mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 548 forks source link

Switch dataset locations from Google Drive to MLCommons Cloud #680

Closed nathanw-mlc closed 4 months ago

nathanw-mlc commented 10 months ago

Some datasets residing on Google Drive have moved to MLCommons' Cloud storage solution. This PR updates the instructions for acquiring the datasets to link to the MLCommons Cloud storage location.

github-actions[bot] commented 10 months ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

nv-rborkar commented 9 months ago

@sgpyc tagging reference owner for review. Thanks Nathan!

arjunsuresh commented 9 months ago

Hi @nathanw-mlc can you please tell if there is a way to do checksum check while downloading files from MLCommons cloud? We do get errors like this due to incomplete downloads which are not getting flagged.

WarrenSchultz commented 9 months ago

I'm also getting failed downloads for DLRMv2. Other downloads via MLCommons cloud have extremely erratic download rates.

nathanw-mlc commented 9 months ago

Has this been on ongoing problem? There was some server maintenance yesterday that caused unexpected interuptions.

WarrenSchultz commented 9 months ago

For the past week at least, I think?

arjunsuresh commented 9 months ago

@nathanw-mlc For us the concern is we are not able to validate the downloaded file as the checksum is not constant across repeated downloads. We first noticed this for gptj-6B download which was 2 months back - checksums were different but the downloaded zip file was extracted successfully and worked as expected.

nathanw-mlc commented 9 months ago

Hmm, that's very strange. Thanks for bringing this to my attention. I'm looking into it.

arjunsuresh commented 9 months ago

Thank you @nathanw-mlc . This is the relevant issue.