Closed qpjaada closed 1 year ago
@sgpyc what do you think?
In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue
Currently, the Wikipedia dataset is hosted at an MLCommons Google drive location. One has to run the
create_pretraining_data
script with a duplication-factor of 10 and generate a large number of record files (~365GB). And eventually, only ~2% of this generated dataset is used for benchmarking. This has the potential to create variance across different submitters based on which subset of the data the model is trained on. (https://github.com/mlcommons/training/tree/master/language_model/tensorflow/bert#generate-the-tfrecords-for-wiki-dataset)The request here is to create a much smaller pretraining dataset by running the
create_pretraining_data
script with a duplication-factor of 1 and host this smaller dataset at a convenient location. This would help in standardizing the dataset and considerably reduce the burden on any new submitter.