Open koenvandesande opened 5 years ago
You mean that all the images are in a zip
file?
And how would the structure of the reading be? Does it unzip it all locally, or read the zipped file without uncompressing it all?
In general, I don't see why this would be something specific to the COCO dataset. But finding a generic way of supporting this for all datasets is something that would be great to have.
Yes, all the images are in a zip file and they are read without unzipping. With the constraint (added by me) that the ZIP file shouldn't use compression (which is the case for COCO). Note that ZIP files are suited for this because they have an index. For tar files, it isn't very efficient because you need to walk over the entire file first to build an index. I'll first create something just for COCO, and then we can look at which other datasets are stored as ZIP files.
This could easily apply to the following datasets as well (because they are stored as ZIP files):
For large datasets on e.g. university clusters, where your data storage is an NFS mount, reading individual files can be slow. It also doesn't support reading ahead. In the cloud, you typically have SSD storage, but unzipping the dataset still takes time.
Would you be open to receiving a pull request that reads the COCO dataset from its zipped version? It adds around 10 lines in the COCO Detection class, and adds another Python file for reading ZIP files in a fork-safe manner (so it works with distributed training).