mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 549 forks source link

Steps for language model #638

Closed mahmoodn closed 1 year ago

mahmoodn commented 1 year ago

Hi, It seems that the readme file for language model is confusing. I followed the steps in dataset.md and everything seems to the right.

git clone https://github.com/sgpyc/training
cd language_model/tensorflow/bert/cleanup_scripts
source download_and_umcompress.sh
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
git checkout 3162bb6c3c9ebd2d15be507aa11d6fa818a454ac
cd .. 
python wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml
./process_wiki.sh './text/*/wiki_??'

So, the dataset preparation is done, I think. Now, when I check readme.md, I don't know from where I should continue.

Should I continue from Generate the TFRecords for Wiki dataset?