[BERT] Standardize and simplify pretraining dataset for BERT

qpjaada commented 3 years ago

Currently, the Wikipedia dataset is hosted at an MLCommons Google drive location. One has to run the create_pretraining_data script with a duplication-factor of 10 and generate a large number of record files (~365GB). And eventually, only ~2% of this generated dataset is used for benchmarking. This has the potential to create variance across different submitters based on which subset of the data the model is trained on. (https://github.com/mlcommons/training/tree/master/language_model/tensorflow/bert#generate-the-tfrecords-for-wiki-dataset)

The request here is to create a much smaller pretraining dataset by running the create_pretraining_data script with a duplication-factor of 1 and host this smaller dataset at a convenient location. This would help in standardizing the dataset and considerably reduce the burden on any new submitter.

johntran-nv commented 1 year ago

@sgpyc what do you think?

peladodigital commented 1 year ago

In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue

mlcommons / training

[BERT] Standardize and simplify pretraining dataset for BERT #497