pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.09k stars 6.94k forks source link

Incorrect processing of redirects using download_url() #3235

Closed slipnitskaya closed 3 years ago

slipnitskaya commented 3 years ago

🐛 Bug

torchvision.datasets.utils.download_url() processes redirects incorrectly. An attempt to download via URL that returns redirect headers fails and results in an empty file.

To Reproduce

The behavior can be reproduced using link to the CUB-200-2011 dataset. Here's the example execution:

url = 'http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz'
root = 'data'
download_url(url, root)

Expected behavior

The file is expected to be downloaded correctly given URL.

Environment

cc @pmeier

datumbox commented 3 years ago

@slipnitskaya Thanks for reporting. I don't think simply allowing redirects will do it because the website now hosts the dataset on google drive.

@pmeier Thoughts?

pmeier commented 3 years ago

I never had to deal with redirects so take what I say with a grain of salt.


@slipnitskaya I agree, handling redirects is a good addition and I will review your PR later on. That being said I think @datumbox is right that redirecting won't help you with your problem since download_url does not work for files on Google Drive.

In order achieve what you want, we could include a simple regular expression in the beginning of download_url that checks if we want to download from google drive. If that is the case it could dispatch the call to download_file_from_google_drive.

slipnitskaya commented 3 years ago

Thanks for reviewing!

@pmeier Would be great indeed to add support for dispatching of download requests to Google Drive inside download_url. I could implement this feature, so it could be merged into the upstream sooner.

pmeier commented 3 years ago

Sounds good @slipnitskaya! I think we should split this into two separate PRs. You can simply ping me when it is ready for review.

slipnitskaya commented 3 years ago

@pmeier Dispatching of requests to Google Drive has been added to download_url() (PR #3245)

pmeier commented 3 years ago

Closed in #3236.