Closed melissawm closed 3 years ago
I don't think we want to keep the dataset locally.
Agreed - that won't scale as more tutorials+data sources are added. Curating external data sources will be an important improvement for the tutorials. There are things like git LFS, but the storage and bandwidth quotas are pretty stringent, at least for the free/OSS accounts.
Are there alternatives for getting this dataset online?
Hi @melissawm Thanks for bringing this up. How about this? https://pypi.org/project/mnist/ (https://github.com/datapythonista/mnist) It still relies on http://yann.lecun.com/exdb/mnist/ though but it may fix/bypass the CI error?
import mnist
train_images = mnist.train_images()
train_labels = mnist.train_labels()
test_images = mnist.test_images()
test_labels = mnist.test_labels()
This could also save a lot of lines of code.
Alternatively, we could create/find a GitHub repo that already contains the dataset and load the files from there (change the URL from to https://github.com/{some_repo}/{mnist_dataset_location}).
Agreed - that won't scale as more tutorials+data sources are added. Curating external data sources will be an important improvement for the tutorials.
I don't think we want to keep the dataset locally.
🤔 Those are interesting and good points @melissawm @rossbar. Do you mean the current CI tests require to download the dataset somewhere onto GitHub's VM or somewhere else and you're running out space? Sorry if I misunderstood the issue.
@melissawm Do you know if there is a similar issue with the Pong tutorial? In that example, a "self-made" dataset is created through game observations (frames of the game)—and it's never the same dataset—before the images of the gameplay are preprocessed and fed through a neural net policy.
Do you mean the current CI tests require to download the dataset somewhere onto GitHub's VM or somewhere else and you're running out space?
It's not a space issue, but a server limiting issue. Typically servers that host data for downloading have limits (based on total bandwidth, number of requests/IP, etc.) to prevent requesters from eating up an inordinate amount of resources from the host. We're clearly exceeding that for the current data source. The solution is to either find or host the data ourselves somewhere with sufficient capacity to handle the number of requests we expect to see (which includes CI runs + users running the tutorial via binder, etc.)
We probably can (and should) get around some of the load from CI by caching downloaded data.
How about adding a dependency on scikit-learn here and doing:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home='/location/to/download/to/')
There's no point reinventing this particular wheel.
We probably can (and should) get around some of the load from CI by caching downloaded data.
yes that's necessary for all large datasets
Cool, thanks @melissawm @rossbar @rgommers đź‘Ť đź’Ż
In the interest of science 🔬 do you mind if I use the import mnist
(https://pypi.org/project/mnist/) first and see if this returns the same error from the server that hosts http://yann.lecun.com/exdb/mnist/
?
I've always wanted to keep the tutorial free of any ML-framework—that was the reason why I modified the original code, which used Keras to download MNIST from http://yann.lecun.com/exdb/mnist/
with keras.datasets.mnist()
(currently, it's tf.keras.datasets.mnist.load_data()
.
And, if that doesn't work, we can use SciKit Learn, since its solution loads the data from https://www.openml.org/d/554
(Please note that sklearn.datasets.fetch_mldata()
will soon be deprecated from v0.20 (and deleted from v0.24), so we should probably instruct the users to install v0.20 and use sklearn.datasets.fetch_openml('mnist_784', version=1,...)
).
Assuming the necessary data is available, I prefer using scikit-learn as it is a well-established member of the ecosystem.
@rossbar True that. Also, it'd be parsing the data from https://www.openml.org
instead of the "lower-powered" server, which is
Update after the NumPy meeting: the file ARF format is unfavorable, so we're back to the original dataset source. See https://github.com/numpy/numpy-tutorials/pull/66#issue-588441383 (by @rossbar)
We have been getting an error in the CI for the MNIST tutorial and I just figured the reason: we are getting a
403 - Forbidden
when we try to download the datasets from the website listed in the tutorial. Checking that website I got a message:I don't think we want to keep the dataset locally. Are there alternatives for getting this dataset online? @8bitmp3 do you have any thoughts here?