Open wymxz opened 4 years ago
谢谢,关于文档中的pretraining data的整个流程我不太懂,preprocess/pretrain/process.sh里是做什么处理的,我是通过WikiExtractor.py把enwiki-latest-pages-articles.xml.bz2转换为txt文件,接下来怎么处理,是通过process.sh吗
Yeah, please follow the script.
我执行process.sh报错,下面是错误:我不知道怎么解决。 calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:60:18: error: expected unqualified-id before ‘[’ token const auto [it, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:60:55: error: expected primary-expression before ’ token const auto [it, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:61:12: error: ‘ins’ was not declared in this scope if (!ins) { ^ calc_wordfreq.cpp:62:11: error: ‘it’ was not declared in this scope ++it->second; ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:20: error: expected unqualified-id before ‘[’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:20: error: expected ‘;’ before ‘[’ token calc_wordfreq.cpp:82:21: error: ‘k’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:24: error: ‘v’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:82:27: error: expected ‘{’ before ‘:’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:27: error: expected ‘;’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ’ token calc_wordfreq.cpp:82:27: error: expected ‘)’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ’ token (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ ls calc_wordfreq.cpp process.sh split.py concat_short_sentences.py replace_patterns.py tupedata filter_and_cleanup_lines.py segment_sentence.py WikiExtractor.py (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq.cpp (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ bash process.sh calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:60:18: error: expected unqualified-id before ‘[’ token const auto [zit, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:60:56: error: expected primary-expression before ‘)’ token const auto [zit, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:61:12: error: ‘ins’ was not declared in this scope if (!ins) { ^ calc_wordfreq.cpp:62:11: error: ‘zit’ was not declared in this scope ++zit->second; ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:20: error: expected unqualified-id before ‘[’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:20: error: expected ‘;’ before ‘[’ token calc_wordfreq.cpp:82:21: error: ‘k’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:24: error: ‘v’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:82:27: error: expected ‘{’ before ‘:’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:27: error: expected ‘;’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token calc_wordfreq.cpp:82:27: error: expected ‘)’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq.cpp (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ bash process.sh calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:60:18: error: expected unqualified-id before ‘[’ token const auto [zit, zins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:60:57: error: expected primary-expression before ‘)’ token const auto [zit, zins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:61:12: error: ‘zins’ was not declared in this scope if (!zins) { ^ calc_wordfreq.cpp:62:11: error: ‘zit’ was not declared in this scope ++zit->second; ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:20: error: expected unqualified-id before ‘[’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:20: error: expected ‘;’ before ‘[’ token calc_wordfreq.cpp:82:21: error: ‘k’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:24: error: ‘v’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:82:27: error: expected ‘{’ before ‘:’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:27: error: expected ‘;’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token calc_wordfreq.cpp:82:27: error: expected ‘)’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq.cpp
please ensure that you have c++ compiler to compile calc_wordfreq.cpp
.
you can install the latest g++, with c++17 support.
谢谢你的解答,但我的训练数据找不到book_corpus的训练数据,可以发我吗
我终于完成数据预处理了,谢谢您的回答和帮助,我想问一下按照你的参数在4块16g的v100上要跑多久能跑完?
@wymxz 估计比较久,我这边用 8 个 16G 的v100 (with nvlink),差不多要六七天。
Hi Guolin,
Have you ever met this issue when running the process.sh
?
It comes from the line 72 in process.sh
.
./fastbpe applybpe \
$DATA_DIR/corpus.train.tok.bpe \
$DATA_DIR/corpus.train.tok.tmp \
$DATA_DIR/bpe-code
@wyu97 not actually, it seems there are something wrong in data. You can also try bert-tokenizer from huggingface, it is much easier to use.
@guolinke So, the difference between file .bpe
and .tmp
is whether the words in the file are tokenized, right? In other words, all lines are not changed, only words are divided into subwords. Since I cannot run the code now (it still raises the error and I have not fixed it yet), I do not know what is the /bpe
file looks like.
The dict is in https://guolinke.blob.core.windows.net/tupe/tupe_ckp.tar.gz