Closed morphine00 closed 8 months ago
Additional comment: we noticed that the two scripts that download a sequence of TAR files also download a checksum file and run its verification; but said verification will bring up a lot of errors because it tries to check checksums on many associated files that weren't downloaded (ex: JSON metadata). Was this the intended behavior?
Additional comment: we noticed that the two scripts that download a sequence of TAR files also download a checksum file and run its verification; but said verification will bring up a lot of errors because it tries to check checksums on many associated files that weren't downloaded (ex: JSON metadata). Was this the intended behavior?
This is expected, although could've been handled better.
There are two types of meta files. parquet files which have the original image links, and json files generated by img2dataset and include information about the download process like the error code in case failing to download a file.
The meta files were uploaded along with the dataset (you can view them here), but I didn't include them in the download scripts to save on bandwidth. Changing the validation command from:
sha512sum --quiet -c sha512sums.txt
to
cat sha512sums.txt | grep .tar | sha512sum --quiet -c
should get rid of the warnings
It's trivial to change that line. Although we checked and quite frankly, the dataset itself is many gigabytes, while the support files are a few MB each. Perhaps it's simply better to have the download scripts grab the entire directories?
bump
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅