numpy / numpy-tutorials

NumPy tutorials & educational content in notebook format
https://numpy.org/numpy-tutorials/
Other
484 stars 184 forks source link

MNIST dataset can't be downloaded automatically #63

Closed melissawm closed 3 years ago

melissawm commented 3 years ago

We have been getting an error in the CI for the MNIST tutorial and I just figured the reason: we are getting a 403 - Forbidden when we try to download the datasets from the website listed in the tutorial. Checking that website I got a message:

Please refrain from accessing these files from automated scripts with high frequency. Make copies!

I don't think we want to keep the dataset locally. Are there alternatives for getting this dataset online? @8bitmp3 do you have any thoughts here?

rossbar commented 3 years ago

I don't think we want to keep the dataset locally.

Agreed - that won't scale as more tutorials+data sources are added. Curating external data sources will be an important improvement for the tutorials. There are things like git LFS, but the storage and bandwidth quotas are pretty stringent, at least for the free/OSS accounts.

8bitmp3 commented 3 years ago

Are there alternatives for getting this dataset online?

Hi @melissawm Thanks for bringing this up. How about this? https://pypi.org/project/mnist/ (https://github.com/datapythonista/mnist) It still relies on http://yann.lecun.com/exdb/mnist/ though but it may fix/bypass the CI error?

import mnist

train_images = mnist.train_images()
train_labels = mnist.train_labels()

test_images = mnist.test_images()
test_labels = mnist.test_labels()

This could also save a lot of lines of code.

Alternatively, we could create/find a GitHub repo that already contains the dataset and load the files from there (change the URL from to https://github.com/{some_repo}/{mnist_dataset_location}).

Agreed - that won't scale as more tutorials+data sources are added. Curating external data sources will be an important improvement for the tutorials.

I don't think we want to keep the dataset locally.

🤔 Those are interesting and good points @melissawm @rossbar. Do you mean the current CI tests require to download the dataset somewhere onto GitHub's VM or somewhere else and you're running out space? Sorry if I misunderstood the issue.

@melissawm Do you know if there is a similar issue with the Pong tutorial? In that example, a "self-made" dataset is created through game observations (frames of the game)—and it's never the same dataset—before the images of the gameplay are preprocessed and fed through a neural net policy.

rossbar commented 3 years ago

Do you mean the current CI tests require to download the dataset somewhere onto GitHub's VM or somewhere else and you're running out space?

It's not a space issue, but a server limiting issue. Typically servers that host data for downloading have limits (based on total bandwidth, number of requests/IP, etc.) to prevent requesters from eating up an inordinate amount of resources from the host. We're clearly exceeding that for the current data source. The solution is to either find or host the data ourselves somewhere with sufficient capacity to handle the number of requests we expect to see (which includes CI runs + users running the tutorial via binder, etc.)

We probably can (and should) get around some of the load from CI by caching downloaded data.

rgommers commented 3 years ago

How about adding a dependency on scikit-learn here and doing:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home='/location/to/download/to/')

There's no point reinventing this particular wheel.

We probably can (and should) get around some of the load from CI by caching downloaded data.

yes that's necessary for all large datasets

8bitmp3 commented 3 years ago

Cool, thanks @melissawm @rossbar @rgommers đź‘Ť đź’Ż

In the interest of science 🔬 do you mind if I use the import mnist (https://pypi.org/project/mnist/) first and see if this returns the same error from the server that hosts http://yann.lecun.com/exdb/mnist/?

I've always wanted to keep the tutorial free of any ML-framework—that was the reason why I modified the original code, which used Keras to download MNIST from http://yann.lecun.com/exdb/mnist/ with keras.datasets.mnist() (currently, it's tf.keras.datasets.mnist.load_data().

And, if that doesn't work, we can use SciKit Learn, since its solution loads the data from https://www.openml.org/d/554 (Please note that sklearn.datasets.fetch_mldata() will soon be deprecated from v0.20 (and deleted from v0.24), so we should probably instruct the users to install v0.20 and use sklearn.datasets.fetch_openml('mnist_784', version=1,...)).

rossbar commented 3 years ago

Assuming the necessary data is available, I prefer using scikit-learn as it is a well-established member of the ecosystem.

8bitmp3 commented 3 years ago

@rossbar True that. Also, it'd be parsing the data from https://www.openml.org instead of the "lower-powered" server, which is

image

Update after the NumPy meeting: the file ARF format is unfavorable, so we're back to the original dataset source. See https://github.com/numpy/numpy-tutorials/pull/66#issue-588441383 (by @rossbar)