mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.62k stars 560 forks source link

Data download for Stable Diffusion fails #731

Closed coppock closed 4 months ago

coppock commented 7 months ago

After building the Docker image provided in stable_diffusion, the first data download command fails as follows:

root@0d839dc3dd25:/workspace# scripts/datasets/laion400m-filtered-download-moments.sh --output-dir /datasets/laion-400m/webdataset-moments-filtered
scripts/datasets/laion400m-filtered-download-moments.sh: line 18: rclone: command not found
scripts/datasets/laion400m-filtered-download-moments.sh: line 20: rclone: command not found
scripts/datasets/laion400m-filtered-download-moments.sh: line 22: rclone: command not found
sha512sum: sha512sums.txt: No such file or directory
root@0d839dc3dd25:/workspace# 
amasin2111 commented 4 months ago

Observing same issue

ahmadki commented 4 months ago

The issue originated after merging: https://github.com/mlcommons/training/pull/712

The dataset was being downloaded from MLC S3 bucket directly using wget, the PR changed the download method to rclone+cloudflare. rclone is not installed in the docker image so I added it in: https://github.com/mlcommons/training/pull/752

amasin2111 commented 4 months ago

Even if download the rclone separately, then use the script laion400m-filtered-download-images.sh, we were getting an error that the source directory doesn't exist. Specifically below command is giving this error rclone copy mlc-training:mlcommons-training-wg-public/stable_diffusion/datasets/laion-400m/moments-webdataset-filtered/ ${OUTPUT_DIR} --include="*.tar" -P"

ahmadki commented 4 months ago

I just saw https://github.com/mlcommons/training/issues/751, I'll look into at and solve the issue ASAP.