Open adsnaider opened 3 years ago
Thank you for reporting. The download manager record the checksum for each downloaded file sequentially. This was done to avoid redownloading all files, even if the generation script crash. We could try to improve this. Another issue is that your final checksum file will contains checksums for 1M urls, so can be quite big. The workaround for now would be to not use checksums at all.
Short description When using the DownloadManager to download many small files (1M+ images), if register checksum is disabled, the download seems to go relatively fast. However, if register checksums is enabled, then the download is painfully slow. We are talking about multiple orders of magnitude difference. I'm doing this with a non-beam dataset. I'm unsure if this has something to do with the parallelization of the downloads. The documentation says that if the dl_manager receives a data structure to download it will parallelize it. Does parallelization not work when register checksums is enabled? If this is the case, at the very least it would be nice to update the documentation to clarify this.
Environment information
Operating System: Ubuntu 20.04
Python version: 3.8.5
tensorflow-datasets
version: 4.1.0tensorflow
version: 2.3.1Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) YesReproduction instructions
I'm using dl_manager to download files from S3. So to reproduce this issue, we can try comparing at the speed of downloading multiple small files from S3, once with register_checksums enabled and once disabled. In my case, the size of the dataset is upwards of 70GB, but I don't believe this needs to be the case: a couple GB will probably be enough.
Expected behavior I expected the download speed to not change so drastically due to checksums registration.