mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.59k stars 553 forks source link

Specific version of wikiextractor? #429

Closed sparticlesteve closed 1 year ago

sparticlesteve commented 3 years ago

I'm having numerous issues with the current version of wikiextractor for preprocessing the BERT dataset. The code is currently broken with missing imports, undefined variables, and python version compatibility problems.

Can someone please point me to a working version (fork or git hash) of wikiextractor that they used to preprocess the wikipedia dataset for BERT?

wei-v-wang commented 3 years ago

Hi Steve,

I used 16186e290d9eb0eb3a3784c6c0635a9ed7e855c3

i.e. git clone https://github.com/attardi/wikiextractor.git

cd wikiextractor

git checkout 16186e290d9eb0eb3a3784c6c0635a9ed7e855c3

python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml # Results are placed in cleanup_scripts/text

Details can be found at: https://github.com/IntelAI/models/blob/bert-lamb-pretraining-tf-2.2/quickstart/language_modeling/tensorflow/bert_large/training/bfloat16/HowToGenerateBERTPretrainingDataset.txt

Hope this still helps.

Best wishes, -Wei

sparticlesteve commented 3 years ago

Thanks Wei. I eventually figured from the timeline of mlperf training v0.7 that a version around March was probably correct. I had some luck with e4abb4cbd019b0257824ee47c23dd163919b731b which is equivalent to yours.

Perhaps the reference implementation instructions could be updated to specify the version. I see there are some open issues/PRs with some BERT updates for v1.0 so I'll see if there are documentation fixes already incoming.

wei-v-wang commented 3 years ago

Nice Steve, glad to hear you got passed this issue.

peladodigital commented 1 year ago

In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue