jckantor / nbpages

Command line tool to maintain a repository of Jupyter notebooks.
https://jckantor.github.io/nbpages/
MIT License
2 stars 4 forks source link

Critical: error with loading data files in Colab #47

Closed adowling2 closed 3 years ago

adowling2 commented 4 years ago

Take a look at this notebook. Open it in Colab. https://ndcbe.github.io/cbe67701-uncertainty-quantification/01.01-Contributed-Example.html

Running this line:

stock_data = pd.read_csv('./data/Stock_Data.csv')

Gave the following error:

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

<ipython-input-4-4072e866c75c> in <module>()
----> 1 stock_data = pd.read_csv('./data/Stock_Data.csv')

4 frames

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File ./data/Stock_Data.csv does not exist: './data/Stock_Data.csv'

Here is the problem: When writing to HTML, we need to point to the text file in GitHub.

@jckantor I would like to share this notebook with the class on Tuesday.

jckantor commented 4 years ago

Yes, that's a problem.

I'm wondering about the fix. Rewriting the link by prepending the github url would to correctly handle the case of './data/Stock_data.csv' and 'data/Stock_data.csv'. I think os.path.join(github_pages_url, data_file) might do the trick. But what about writes? This approach would also need to distinguish file reading from file writing since we can't have colab write back to git pages. Since there are multiple was to write a file, like

os.open(fname).write(content) print(content, f=fname) with os.open(fname) as f: f.write() pd.to_csv(fname)

it's going to be difficult to catch these and correctly distinguish writes and reads. There's a similar also a similar problem with files appearing in the figures subdirectory. This is nasty because ./figures figures that show up on the html pages will not show on colab. And there's the same issue of writing figures.

How about adding one more button alongside 'download' and 'open in colab' called 'download .zip data/figure files'? So you push that button and presto, your local directory now includes the ./data and ./figure directories.

jckantor commented 4 years ago

Another idea. For each notebook, create a zip file containing a directory with the notebook file name. The directory contains that notebook plus data and figure subdirectories the data and figures used in that notebook. So the user action would be download, unzip, and the move to that directory. While a few more clicks, we can present the user with a complete notebook. Even include a code subdirectory, too, for python code that could be distributed with the notebook.

Not sure exactly how to handle this in Colab.

jckantor commented 4 years ago

Still one more idea. Add a code cell to the notebook header that creates local data/figure/code subdirectories, then copies required content from github repository. This has the advantage of working the same regardless of anaconda or colab. The only change in UX is to execute one additional cell (if needed).

jckantor commented 4 years ago

I've added some comments to the issue, and am thinking about this one. It's tricky, and I'm not sure anyone has a great answer for how to organize notebooks that will be run in different environments. One practice I've read about is to place each notebook in a separate directory with associated files and subdirectories. Which would address this issue, but require some clumsy zip/download/unzip/change directory actions.

Another thought, since we're doing some notebook rewriting anyways, is to add a code cell at the top of a published notebook that does the necessary downloads from github to local subdirectories. Below is an example of what that might look like for the notebook that's giving you trouble.

What you could do for Tuesday is just past this cell near the top of the notebook as part of the routine imports for the notebook. In the meanwhile, I'll try to think through how this might be integrated into nbpages.

Jeff

import os, requests

github pages url with terminal /

url = "https://ndcbe.github.io/cbe67701-uncertainty-quantification/"

create local subdirectories

for d in ("data", "figures"): if not os.path.exists(d): os.mkdir(d) assert os.path.isdir(d), f"directory {d} is not available"

download files to local subdirectories

download_files = ['data/Stock_Data.csv'] for file in download_files: r = requests.get(url + file) with open(file, 'wb') as f: f.write(r.content)

On Thu, Jun 11, 2020 at 11:43 AM Alex Dowling notifications@github.com wrote:

Take a look at this notebook. Open it in Colab.

https://ndcbe.github.io/cbe67701-uncertainty-quantification/01.01-Contributed-Example.html

Running this line:

stock_data = pd.read_csv('./data/Stock_Data.csv')

Gave the following error:


FileNotFoundError Traceback (most recent call last)

in () ----> 1 stock_data = pd.read_csv('./data/Stock_Data.csv') 4 frames /usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds) 1889 kwds["usecols"] = self.usecols 1890 -> 1891 self._reader = parsers.TextReader(src, **kwds) 1892 self.unnamed_cols = self._reader.unnamed_cols 1893 pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source() FileNotFoundError: [Errno 2] File ./data/Stock_Data.csv does not exist: './data/Stock_Data.csv' Here is the problem: When writing to HTML, we need to point to the text file in GitHub. @jckantor I would like to share this notebook with the class on Tuesday. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .
jckantor commented 4 years ago

Even better ...

import os, requests, urllib

github pages url with terminal /

url = "https://ndcbe.github.io/cbe67701-uncertainty-quantification/"

relative file paths to download

file_paths = ['data\Stock_Data.csv']

for file_path in file_paths: stem, filename = os.path.split(file_path) if stem: if not os.path.exists(stem): os.mkdir(stem) if not os.path.isfile(file_path): with open(file_path, 'wb') as f: f.write(requests.get(urllib.parse.urljoin(url, urllib.request.pathname2url(file_path))).content)

On Thu, Jun 11, 2020 at 2:51 PM Jeffrey Kantor Kantor.1@nd.edu wrote:

I've added some comments to the issue, and am thinking about this one. It's tricky, and I'm not sure anyone has a great answer for how to organize notebooks that will be run in different environments. One practice I've read about is to place each notebook in a separate directory with associated files and subdirectories. Which would address this issue, but require some clumsy zip/download/unzip/change directory actions.

Another thought, since we're doing some notebook rewriting anyways, is to add a code cell at the top of a published notebook that does the necessary downloads from github to local subdirectories. Below is an example of what that might look like for the notebook that's giving you trouble.

What you could do for Tuesday is just past this cell near the top of the notebook as part of the routine imports for the notebook. In the meanwhile, I'll try to think through how this might be integrated into nbpages.

Jeff

import os, requests

github pages url with terminal /

url = "https://ndcbe.github.io/cbe67701-uncertainty-quantification/"

create local subdirectories

for d in ("data", "figures"): if not os.path.exists(d): os.mkdir(d) assert os.path.isdir(d), f"directory {d} is not available"

download files to local subdirectories

download_files = ['data/Stock_Data.csv'] for file in download_files: r = requests.get(url + file) with open(file, 'wb') as f: f.write(r.content)

On Thu, Jun 11, 2020 at 11:43 AM Alex Dowling notifications@github.com wrote:

Take a look at this notebook. Open it in Colab.

https://ndcbe.github.io/cbe67701-uncertainty-quantification/01.01-Contributed-Example.html

Running this line:

stock_data = pd.read_csv('./data/Stock_Data.csv')

Gave the following error:


FileNotFoundError Traceback (most recent call last)

in () ----> 1 stock_data = pd.read_csv('./data/Stock_Data.csv') 4 frames /usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds) 1889 kwds["usecols"] = self.usecols 1890 -> 1891 self._reader = parsers.TextReader(src, **kwds) 1892 self.unnamed_cols = self._reader.unnamed_cols 1893 pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source() FileNotFoundError: [Errno 2] File ./data/Stock_Data.csv does not exist: './data/Stock_Data.csv' Here is the problem: When writing to HTML, we need to point to the text file in GitHub. @jckantor I would like to share this notebook with the class on Tuesday. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .
adowling2 commented 4 years ago

I like the concept. I will give this a try.

What do you think about creating an nbpages library? It could include functions to:

I am just not sure the best way to do this. One idea would be for nbpages to copy nbpages.py from itself to /docs/. But how could we then import the library when using colab?

adowling2 commented 4 years ago

I implemented your solution and I can verify it works. Check it out: https://ndcbe.github.io/cbe67701-uncertainty-quantification/01.01-Contributed-Example.html

What do you think about an nbpages utilities library? I know you did a library in your controls class. How did that work with Colab?

adowling2 commented 4 years ago

@jckantor More specifically, how did you get the CBE30338 library to work? https://nbviewer.jupyter.org/github/jckantor/CBE30338/blob/master/notebooks/A.01-Python-Library-for-CBE30338.ipynb

Is this a PyPi package? Does it work with Colab or would one need to add a line to pip install it when running from Colab?

jckantor commented 4 years ago

I've played with several ideas for a library. Pypi is a solid alternative and makes very easy to distribute and install code, including for Colab via pip. On the other hand, it works best if you have the discipline to package things carefully.  That's what I was trying do do with CBE30338 out of the same repo as the notebooks.

Then there are the simple examples and utility functions that one wants to distribute quickly and with minimal overhead. For that a code subdirectory in parallel with data and figures might be enough. So two ideas. They're not mutually exclusive, but I don't have a firm idea which would be the better approach for most use cases.

adowling2 commented 4 years ago

My suggestion would be to stand up a separate repo for a package nbpagesutilities or something similar. This would be an extremely lightweight library with useful functions to streamline use with Colab:

We would then have nbpages write each notebook a few lines of code to install nbpagesutilities. Then nbpages would determine the files in the notebook and construct a call to nbpu.getfile().

Why the separate repo? We want to keep this super lightweight. It will also be easier to document. I don't want students to get overwhelmed with the documentation for nbpages.

Thoughts?

Bonus: You won't need to duplicate code to install Pyomo into each notebook. And if it changes, we just need to update nbpagesutilities.

adowling2 commented 4 years ago

I am going to open a separate issue for figures. I think those can be handled separately with some extra code in nbpages when rewriting the notebook.

jckantor commented 4 years ago

Take a look at https://jckantor.github.io/nbpages/02.04-Working-with-Data-and-Figures.html

The first code cell was generated and inserted by nbpages. When run, it downloads (if needed) from the docs repo any data files that appear in the notebook. This should run on colab, vocareum, anaconda, or any other platform allowing requests.

I'm tempted to add figures to the same mechanism. It would add no additional lines of code to the target notebook.

Let me know what you think of this solution. It's pretty generic with minimal intervention by the author or student. The asthetic look of the cell leaves a bit to be desired, though.

adowling2 commented 4 years ago

I implemented a similar solution here yesterday: https://ndcbe.github.io/cbe67701-uncertainty-quantification/01.01-Contributed-Example.html#1.1.5.1-Download-the-data-into-Colab

My thought is to package this into a lightweight nbpagesutilities package. That means one less piece of code for students to worry about or accidentally change. I would also put the code to install extra packages in the same spot. Why do this? That way, if say Pyomo changes how they install, you can just fix it in one place and all of your repos will work.

I am happy to write the code and help test. I just need help creating a Pypi package.

jckantor commented 4 years ago

Here's the basic outline of packaging for pypi ... https://packaging.python.org/tutorials/packaging-projects/#uploading-your-project-to-pypi. I actually have that done for nbpages. It would easy to just add a /util module which would be imported as nbpages.util

The new functionality for data files does a scan of all data files, locates them in notebooks, then, as needed for each notebook, automatically inserts the required import code. So this is totally automatic. What might be useful is considering the use case where a course developer wants to add a course specific code module.

I think the generic class to write would be for ancillary files stored in notebook subdirectories. So far we have figures and data. Adding a code directory would make a lot of sense as a means to share code across notebooks.

jckantor commented 4 years ago

Thinking about this some more (while riding the lawn tractor this afternoon), what we really have here is the generic issue of managing the resources needed to use a notebook. Those resources include data files, image files, code, and libraries that may be local or remote. And what we want to do is create a portable means for a notebook to designate what resources are required, and provide a means to get them.

So maybe for a later version of nbpages we can collapse the data/figures/... subdirectory scheme into a more generic resources directory, and the various indices into a generic resource index. It would simplify the code structure and logic of the whole system.

adowling2 commented 4 years ago

I really link the idea of supporting a /src/ directory to contain custom class modules. This would be much cleaner than creating a model for each class.

Here are my argument for creating separate packages for nbpages and nbpagesutil:

But I also see the convenience of just maintaining one package. I'll leave the decision to you.

But overall, I think a package with utilities is the way to go. This keeps the code hidden from (most) users, which is good. Another case to consider; some students will develop in Colab. They'll want to first upload a file to GitHub, then manually use the package to download it to their Colab space.

How can I help make the package happen?