pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.15k stars 6.94k forks source link

[proposal] Move dataset tar/zip extraction and integrity checking of multiple files to utils.py #441

Open vadimkantorov opened 6 years ago

vadimkantorov commented 6 years ago

Datasets such as CIFAR / MNIST / etc have download / integrity logic that is useful for reimplementing custom user datasets (such as https://github.com/vadimkantorov/metriclearningbench/blob/master/cars196.py, https://github.com/vadimkantorov/metriclearningbench/blob/master/cub2011.py, https://github.com/vadimkantorov/metriclearningbench/blob/master/stanford_online_products.py)

I propose moving it to torchvision/datasets/utils.py functions like downloading and extracting tarballs / zipfiles / plain files; checking integrity by md5 of a file list if it is provided.

Currently avoiding duplication leads to quirky subclassing of ImageFolder, Cifar10 etc.

fmassa commented 6 years ago

I agree with this. I think the only reason why we haven't done this yet is that each dataset has its own particularity. But a clean and concise function that handles all the above cases would be awesome to have!

reynoldscem commented 6 years ago

I think it would be good to come up with a fixed list of functionality which should be factored out into utils so this can be tackled as a fixed-scope piece of work rather than a general refactor. Happy to do so once it's established.

fmassa commented 6 years ago

I had a quick look at it, and I have the impression that most of the basic functionality is already present under .utils.py: downloading and integrity checking.

The current integrity checking/downloading code that we have in each dataset is very minimal and specific to each dataset, so it's unclear to me if we can further factor it out without making these functions overly complex.

vadimkantorov commented 6 years ago

Currently utils.py, does not include functionality to extract a tarball or a zip archive and to check integrity of multiple files as routinely done in datasets (boiler-plate loops needed).

Many Dataset impl code consist only of this archive extraction logic, so even considering it is a small number of lines of code, it takes easily 60% of code. Feel free to take a look at https://github.com/vadimkantorov/metriclearningbench/blob/master/cars196.py, https://github.com/vadimkantorov/metriclearningbench/blob/master/cub2011.py, https://github.com/vadimkantorov/metriclearningbench/blob/master/stanford_online_products.py

fmassa commented 6 years ago

I'm not sure it's worth the added complexity. But I'm happy to see a proposal that keeps the simplicity of the current functions while allowing those cases to be handled.

emilmelnikov commented 6 years ago

I'd like to make a somewhat similar proposal: implement dataset downloading through command line, something like the following:

# Download MNIST into "datadir"
python -m torchvision.datasets MNIST datadir