Read COCO dataset from ZIP file?

pytorch / vision

Datasets, Transforms and Models specific to Computer Vision

https://pytorch.org/vision

BSD 3-Clause "New" or "Revised" License

16.22k stars 6.95k forks source link

Read COCO dataset from ZIP file? #947

Open koenvandesande opened 5 years ago

koenvandesande commented 5 years ago

For large datasets on e.g. university clusters, where your data storage is an NFS mount, reading individual files can be slow. It also doesn't support reading ahead. In the cloud, you typically have SSD storage, but unzipping the dataset still takes time.

Would you be open to receiving a pull request that reads the COCO dataset from its zipped version? It adds around 10 lines in the COCO Detection class, and adds another Python file for reading ZIP files in a fork-safe manner (so it works with distributed training).

fmassa commented 5 years ago

You mean that all the images are in a zip file? And how would the structure of the reading be? Does it unzip it all locally, or read the zipped file without uncompressing it all?

In general, I don't see why this would be something specific to the COCO dataset. But finding a generic way of supporting this for all datasets is something that would be great to have.

koenvandesande commented 5 years ago

Yes, all the images are in a zip file and they are read without unzipping. With the constraint (added by me) that the ZIP file shouldn't use compression (which is the case for COCO). Note that ZIP files are suited for this because they have an index. For tar files, it isn't very efficient because you need to walk over the entire file first to build an index. I'll first create something just for COCO, and then we can look at which other datasets are stored as ZIP files.

koenvandesande commented 5 years ago

This could easily apply to the following datasets as well (because they are stored as ZIP files):

celeba
omniglot
phototour (though not really, because it does postprocessing on the files after extraction)