BERT. Preprocess datasets

ulapopov commented 4 years ago

Following steps in README, I ran these commands:

cd cleanup_scripts  
mkdir -p wiki  
cd wiki  
wget https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2    # Optionally use curl instead  
bzip2 -d enwiki-20200101-pages-articles-multistream.xml.bz2  
cd ..    # back to bert/cleanup_scripts  
git clone https://github.com/attardi/wikiextractor.git  
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml  
`./process_wiki.sh '<text/*/wiki_??'`  
python3 extract_test_set_articles.py

But the last script finishes with error. It can't find all the articles.

321 articles out of 500 found.
Traceback (most recent call last):
  File "extract_test_set_articles.py", line 75, in <module>
    assert len(test_articles) == 500, 'Not all articles were found in shards. Incomplete test set.'
AssertionError: Not all articles were found in shards. Incomplete test set.

My results folder contains 518 (0..517) shards.

wei-v-wang commented 3 years ago

https://github.com/IntelAI/models/blob/bert-lamb-pretraining-tf-2.2/quickstart/language_modeling/tensorflow/bert_large/training/bfloat16/HowToGenerateBERTPretrainingDataset.txt Not sure if this answer came too late.

peladodigital commented 1 year ago

In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue

mlcommons / training

BERT. Preprocess datasets #410