Closed bitfort closed 4 years ago
We ran into difficulty with this at NVIDIA, and it does affect MLM accuracy. The solution we believe is correct is to use the first 10k sample created by the reference preprocessing pipeline, which contains 500 shards. The reference explicitly sequentially loads the first 10k samples of the dataset (code link). While this is not explicit about dataset ordering, it is implied it should be done in the same way the reference handles the code and data. We implemented this by running preprocessing exactly as the reference and using the first shard as a separate dataset since we use a different number of shards for training with the same dataset.
The process appears to be deterministic, so the text file "part-00000-of-00500" can be used to verify the correct 10k samples to be used for eval.
The preprocessing steps are here, but proceed with caution since there is a lingering error in documentation. Google and NVIDIA verified convergence without creating a withheld set. Please follow the instructions in the README.md, omitting the following step: python3 extract_test_set_articles.py
The final step is to use create_pretraining_data.py to create the tfrecords (or update it to output a file format compatible for your use).
To summarize the correct (consistent with how the reference was developed and tested) instructions in one place:
cd cleanup_scripts
mkdir -p wiki
cd wiki
wget https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2 # Optionally use curl instead
bzip2 -d enwiki-20200101-pages-articles-multistream.xml.bz2
cd .. # back to bert/cleanup_scripts
git clone https://github.com/attardi/wikiextractor.git
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml # Results are placed in bert/cleanup_scripts/text
./process_wiki.sh '<text/*/wiki_??'
python3 create_pretraining_data.py \
--input_file=text_shards/part-00000-of-00500 \
--output_file=tfrecords/part-00000-of-00500 \
--vocab_file=<path to vocab.txt> \
--do_lower_case=True \
--max_seq_length=512 \
--max_predictions_per_seq=76 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=10
Please post questions here if anything is still unclear.
If MLPerf V0.7 submitters would like to cross check their validation set with ours, we can provide a comparison link directly to them. Please email a request.
SWG:
Creator says we are able to close this issue.
Some submitters are worried they have the wrong test dataset. How can we validate this?