Closed matthew-frank closed 1 year ago
This will be addressed by https://github.com/mlcommons/training/pull/463
This is a duplicate of https://github.com/mlcommons/training/issues/377, which was closed without fixing the README.
Also a duplicate of https://github.com/mlcommons/training_policies/issues/403, which was also closed without fixing the README
Data is located here: https://drive.google.com/drive/u/4/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT
But we must update readme with instructions on using checkpoints and data.
Closing as a dup.
Competitors are required to use the reference training data to train their submissions. For the BERT benchmark the specified URL to the reference training data is https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2, but that is a dead link. (The site https://dumps.wikimedia.org/enwiki/ is not an archival repository, rather it seems to just be a rolling cache of the last 8 biweekly dumps, so should not be relied on as an available for a specific historical resource.)
We need to make sure all MLPerf submitters have access to the same training data for each benchmark well before the deadline (which is only a month away.)