mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.58k stars 549 forks source link

BERT reference training data is no longer available #467

Closed matthew-frank closed 1 year ago

matthew-frank commented 3 years ago

Competitors are required to use the reference training data to train their submissions. For the BERT benchmark the specified URL to the reference training data is https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2, but that is a dead link. (The site https://dumps.wikimedia.org/enwiki/ is not an archival repository, rather it seems to just be a rolling cache of the last 8 biweekly dumps, so should not be relied on as an available for a specific historical resource.)

We need to make sure all MLPerf submitters have access to the same training data for each benchmark well before the deadline (which is only a month away.)

matthew-frank commented 3 years ago

This will be addressed by https://github.com/mlcommons/training/pull/463

matthew-frank commented 3 years ago

This is a duplicate of https://github.com/mlcommons/training/issues/377, which was closed without fixing the README.

matthew-frank commented 3 years ago

Also a duplicate of https://github.com/mlcommons/training_policies/issues/403, which was also closed without fixing the README

TheKanter commented 3 years ago

Data is located here: https://drive.google.com/drive/u/4/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT

But we must update readme with instructions on using checkpoints and data.

johntran-nv commented 1 year ago

Closing as a dup.