Torchvision dataset mirrors

dhruvbird commented 1 year ago

🚀 The feature

Is it possible for pytorch/torchvision to mirror all the datasets on their own domain/hosts instead of downloading from the original researcher's web page/URL?

Motivation, pitch

More often than not I run into problems when downloading them. For example:

Too many downloads
Bandwidth limit exceeded for the day
Some other outage such as in https://github.com/pytorch/vision/issues/7545

Also when running a Kaggle notebook, it re-downloads every time since there's no way to cache the downloaded dataset.

This will allow the problems above (and more) to go away.

More often than not people work around these issues by using some existing dataset that people have uploaded to Kaggle and defining their own Dataset class to read from that dataset. Alternatively, people may use some "hacks" to make torchvision use an existing Kaggle dataset that isn't in the directory format (name) that torchvision expects. See https://www.kaggle.com/code/dhruv4930/starter-for-oxford-iiit-pet-using-torchvision for an example.

Code copied below.

# Oxford IIIT Pets Segmentation dataset loaded via torchvision.
!rm -f '/kaggle/working/oxford-iiit-pet'
!ln -s '/kaggle/input/oxfordiiitpetfromxijiatao/Oxford-IIT-Pet' '/kaggle/working/oxford-iiit-pet'

oxford_pets_path = '/kaggle/working'
pets_train_orig = torchvision.datasets.OxfordIIITPet(root=oxford_pets_path, split="trainval", target_types="segmentation", download=False)
pets_test_orig = torchvision.datasets.OxfordIIITPet(root=oxford_pets_path, split="test", target_types="segmentation", download=False)

Alternatives

Since I'm personally interested in solving my local problem for Kaggle notebooks, a viable alternative would be to create a Kaggle dataset for every torchvision dataset so that when I use it in Kaggle, I just include it - also using a Kaggle dataset is more reliable in Kaggle notebooks.

However, this is a myopic view of the problem and provides a localized solution to a localized problem. I'm pretty sure that others outside of the narrow scope of a Kaggle notebook have experienced this issue and the previously suggested solution of mirroring the datasets would be more wholistic in terms of being more broad looking.

I'm open to other solutions that work across environments.

Additional context

Thanks for working on torchvision - it's saved me a lot of time on mundane and vision specific tasks!

abhi-glitchhg commented 1 year ago

I like the proposal but, Won't the dataset licence be problem here?

NicolasHug commented 1 year ago

Thanks for the proposal @dhruvbird . Unfortunately as @abhi-glitchhg correcly noted above, hosting the datasets ourselves isn't an option because of licensing / copyright issues. Some datasets just plainly don't allow that. That's why our datasets our merely wrappers over the original URLs. We just provide a convenient way to download them but we don't (and cannot) host anything.

dhruvbird commented 1 year ago

IIUC, most datasets allow non-commercial and education/research related use. I'm not proposing munging or changing them in any way - just mirroring them with a TTL (say 1 day) so that it's refreshed from the original site every day (for example). I think everyone should be happy with this setup and author(s) probably will pay less for bandwidth used (for popular datasets). I'm not a lawyer though, so like you mentioned it's best to check with legal on this.

oke-aditya commented 1 year ago

The only dataset for which torchvision has mirrored is MNIST. This was due to MNIST being so much popular that downtime caused issue.

https://github.com/pytorch/vision/blob/main/torchvision/datasets/mnist.py#L36

But note that this too required licensing and copyrights.

dhruvbird commented 1 year ago

One way to work around this may be to set up a caching HTTP reverse proxy and have all the URLs in the torchvision code to point to it, with a fallback to the original URL in case the proxy is unavailable. The proxy can then be set up to cache HTTP GET requests, etc... Just a thought - since this would significantly improve my (and other's) experience when using these datasets!

seraph9000 commented 2 weeks ago

Also when running a Kaggle notebook, it re-downloads every time since there's no way to cache the downloaded dataset.

I don't believe hosting the dataset would fix this issue, the Kaggle environment would still need to download the dataset from somewhere each time it starts up.

If you're worried about bandwidth or outages, you could upload the dataset to your own bucket and set your own download URLs like so:

torchvision.datasets.MNIST.mirrors = ['https://your-bucket.cloud.com/mnist.tar.gz']

or for other datasets:

torchvision.datasets.CIFAR10.url = 'https://your-bucket.cloud.com/cifar10.tar.gz'

Of course this is all assuming you have licensing / copyright rights to host these datasets.

pytorch / vision