pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.19k stars 6.95k forks source link

torchvision.datasets.mnist has a broken url in mirrors to download the dataset #8568

Open cwestergren opened 3 months ago

cwestergren commented 3 months ago

🐛 Describe the bug

While attempting to download the MNIST dataset using torchvision.datasets.MNIST, I encountered an error that prevents the dataset from downloading successfully. The error indicates an issue with accessing one of the download URLs.

`import torchvision.datasets as datasets from torch.utils.data import DataLoader

val_ds = datasets.MNIST(root='.', train=False, download=True) val_dl = DataLoader(val_ds, batch_size=128, shuffle=True)`

Expected Behavior

The MNIST dataset should be downloaded successfully without encountering any HTTP errors.

Actual Behavior

The download fails with a 403 Forbidden error when attempting to access http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz.

Observations

  1. Unencrypted HTTP Resource: The download is attempting to access a resource over HTTP instead of HTTPS, which may not be secure.
  2. 403 Forbidden Error: The server is returning a 403 Forbidden error, indicating that access to the resource is not allowed.

It's been this way for some time, so suggest updating the list of mirrors in https://github.com/pytorch/vision/blob/main/torchvision/datasets/mnist.py to not lead to an unsecure/broken endpoint.

mirrors = [ "http://yann.lecun.com/exdb/mnist/", "https://ossci-datasets.s3.amazonaws.com/mnist/", ]

Notably, trying to download the same files directly from @ylecun page https://yann.lecun.com/exdb/mnist/index.html fails with the same error.

Versions

PyTorch version: 2.3.1+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Pro GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

Python version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-11-10.0.22631-SP0 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Nvidia driver version: 556.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture=9 CurrentClockSpeed=3696 DeviceID=CPU0 Family=207 L2CacheSize=2560 L2CacheSpeed= Manufacturer=GenuineIntel MaxClockSpeed=3696 Name=Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz ProcessorType=3 Revision=

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.3.1+cu118 [pip3] torchinfo==1.8.0 [pip3] torchvision==0.18.1+cu118 [pip3] torchviz==0.0.2 [conda] Could not collect

NicolasHug commented 3 weeks ago

Thanks for the report @cwestergren . Yeah, sadly the official MNIST mirror from Lecun is down. And the existing alternative is down too (well, it leads to downloading empty files).

We should try to add alternative mirrors from sources we can trust enough. Happy to consider suggestions.

SdgJlbl commented 3 weeks ago

Hello, we are using torchvision to load MNIST for our quickstart example, and even having one of the two mirrors down is a problem for us, since it will display 403 Forbidden errors which are confusing for first-time users (see this Slack message for example). If you find a good alternative mirror, maybe it could be worth deprecating the "official" one (or at least moving it down the list), since it has been down for more than 2 months now.