tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.29k stars 1.54k forks source link

DownloadManager with register checksum is much slower #2901

Open adsnaider opened 3 years ago

adsnaider commented 3 years ago

Short description When using the DownloadManager to download many small files (1M+ images), if register checksum is disabled, the download seems to go relatively fast. However, if register checksums is enabled, then the download is painfully slow. We are talking about multiple orders of magnitude difference. I'm doing this with a non-beam dataset. I'm unsure if this has something to do with the parallelization of the downloads. The documentation says that if the dl_manager receives a data structure to download it will parallelize it. Does parallelization not work when register checksums is enabled? If this is the case, at the very least it would be nice to update the documentation to clarify this.

Environment information

Reproduction instructions

I'm using dl_manager to download files from S3. So to reproduce this issue, we can try comparing at the speed of downloading multiple small files from S3, once with register_checksums enabled and once disabled. In my case, the size of the dataset is upwards of 70GB, but I don't believe this needs to be the case: a couple GB will probably be enough.

Expected behavior I expected the download speed to not change so drastically due to checksums registration.

Conchylicultor commented 3 years ago

Thank you for reporting. The download manager record the checksum for each downloaded file sequentially. This was done to avoid redownloading all files, even if the generation script crash. We could try to improve this. Another issue is that your final checksum file will contains checksums for 1M urls, so can be quite big. The workaround for now would be to not use checksums at all.