Closed sparticlesteve closed 1 year ago
Hi Steve,
I used 16186e290d9eb0eb3a3784c6c0635a9ed7e855c3
i.e. git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
git checkout 16186e290d9eb0eb3a3784c6c0635a9ed7e855c3
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml # Results are placed in cleanup_scripts/text
Details can be found at: https://github.com/IntelAI/models/blob/bert-lamb-pretraining-tf-2.2/quickstart/language_modeling/tensorflow/bert_large/training/bfloat16/HowToGenerateBERTPretrainingDataset.txt
Hope this still helps.
Best wishes, -Wei
Thanks Wei. I eventually figured from the timeline of mlperf training v0.7 that a version around March was probably correct. I had some luck with e4abb4cbd019b0257824ee47c23dd163919b731b which is equivalent to yours.
Perhaps the reference implementation instructions could be updated to specify the version. I see there are some open issues/PRs with some BERT updates for v1.0 so I'll see if there are documentation fixes already incoming.
Nice Steve, glad to hear you got passed this issue.
In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue
I'm having numerous issues with the current version of wikiextractor for preprocessing the BERT dataset. The code is currently broken with missing imports, undefined variables, and python version compatibility problems.
Can someone please point me to a working version (fork or git hash) of wikiextractor that they used to preprocess the wikipedia dataset for BERT?