mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 548 forks source link

Change dataset download scripts to use Cloudflare buckets directly #712

Closed morphine00 closed 3 months ago

github-actions[bot] commented 4 months ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

morphine00 commented 4 months ago

Additional comment: we noticed that the two scripts that download a sequence of TAR files also download a checksum file and run its verification; but said verification will bring up a lot of errors because it tries to check checksums on many associated files that weren't downloaded (ex: JSON metadata). Was this the intended behavior?

ahmadki commented 4 months ago

Additional comment: we noticed that the two scripts that download a sequence of TAR files also download a checksum file and run its verification; but said verification will bring up a lot of errors because it tries to check checksums on many associated files that weren't downloaded (ex: JSON metadata). Was this the intended behavior?

This is expected, although could've been handled better.

There are two types of meta files. parquet files which have the original image links, and json files generated by img2dataset and include information about the download process like the error code in case failing to download a file.

The meta files were uploaded along with the dataset (you can view them here), but I didn't include them in the download scripts to save on bandwidth. Changing the validation command from:

sha512sum --quiet -c sha512sums.txt

to

cat sha512sums.txt | grep .tar | sha512sum --quiet -c

should get rid of the warnings

morphine00 commented 4 months ago

It's trivial to change that line. Although we checked and quite frankly, the dataset itself is many gigabytes, while the support files are a few MB each. Perhaps it's simply better to have the download scripts grab the entire directories?

nathanw-mlc commented 3 months ago

bump