ltechkorea / training_results_v1.1-pre

Other
0 stars 0 forks source link

[bert] dataset download , preprocess #12

Closed dc0953 closed 3 years ago

dc0953 commented 3 years ago

전처리된 데이터 세트 다운로드

https://drive.google.com/drive/folders/1cywmDnAsrP5-2vsr8GDc6QUc7VWe-M3v?usp=sharing

tar xf results_text.tar.gz
cd results4
md5sum --check ../bert_reference_results_text_md5.txt
cd ..

Description of how the results_text.tar.gz file was prepared

  1. First download the wikipedia dump and extract the pages The wikipedia dump can be downloaded from this google drive, and should contain enwiki-20200101-pages-articles-multistream.xml.bz2 as well as the md5sum.

  2. Run WikiExtractor.py, version e4abb4cb from March 29, 2020, to extract the wiki pages from the XML The generated wiki pages file will be stored as /LL/wiki_nn; for example /AA/wiki_00. Each file is ~1MB, and each sub directory has 100 files from wiki_00 to wiki_99, except the last sub directory. For the 20200101 dump, the last file is FE/wiki_17.

  3. Clean up and dataset seperation. The clean up scripts (some references here) are in the scripts directory. The following command will run the clean up steps, and put the resulted trainingg and eval data in ./results ./processwiki.sh 'text/*/wiki??'

  4. After running the process_wiki.sh script, for the 20200101 wiki dump, there will be 500 files named part-00xxx-of-00500 in the ./results directory, together with eval.md5 and eval.txt.

  5. Exact steps (starting in the bert path)

cd input_preprocessing
mkdir -p wiki
cd wiki
# download enwiki-20200101-pages-articles-multistream.xml.bz2 from Google drive and check md5sum
bzip2 -d enwiki-20200101-pages-articles-multistream.xml.bz2
cd ..    # back to bert/input_preprocessing
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
git checkout e4abb4cbd
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml    # Results are placed in bert/input_preprocessing/text
./process_wiki.sh './text/*/wiki_??'