我跑你的代码，数据集那些怎么弄，怎么装载字典，我没有dict.txt

guolinke commented 4 years ago

The dict is in https://guolinke.blob.core.windows.net/tupe/tupe_ckp.tar.gz

wymxz commented 4 years ago

谢谢，关于文档中的pretraining data的整个流程我不太懂，preprocess/pretrain/process.sh里是做什么处理的，我是通过WikiExtractor.py把enwiki-latest-pages-articles.xml.bz2转换为txt文件，接下来怎么处理，是通过process.sh吗

guolinke commented 4 years ago

Yeah, please follow the script.

wymxz commented 4 years ago

我执行process.sh报错，下面是错误：我不知道怎么解决。 calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:60:18: error: expected unqualified-id before ‘[’ token const auto [it, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:60:55: error: expected primary-expression before ’ token const auto [it, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:61:12: error: ‘ins’ was not declared in this scope if (!ins) { ^ calc_wordfreq.cpp:62:11: error: ‘it’ was not declared in this scope ++it->second; ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:20: error: expected unqualified-id before ‘[’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:20: error: expected ‘;’ before ‘[’ token calc_wordfreq.cpp:82:21: error: ‘k’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:24: error: ‘v’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:82:27: error: expected ‘{’ before ‘:’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:27: error: expected ‘;’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ’ token calc_wordfreq.cpp:82:27: error: expected ‘)’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ’ token (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ ls calc_wordfreq.cpp process.sh split.py concat_short_sentences.py replace_patterns.py tupedata filter_and_cleanup_lines.py segment_sentence.py WikiExtractor.py (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq.cpp (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ bash process.sh calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:60:18: error: expected unqualified-id before ‘[’ token const auto [zit, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:60:56: error: expected primary-expression before ‘)’ token const auto [zit, ins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:61:12: error: ‘ins’ was not declared in this scope if (!ins) { ^ calc_wordfreq.cpp:62:11: error: ‘zit’ was not declared in this scope ++zit->second; ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:20: error: expected unqualified-id before ‘[’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:20: error: expected ‘;’ before ‘[’ token calc_wordfreq.cpp:82:21: error: ‘k’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:24: error: ‘v’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:82:27: error: expected ‘{’ before ‘:’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:27: error: expected ‘;’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token calc_wordfreq.cpp:82:27: error: expected ‘)’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq.cpp (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ bash process.sh calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:60:18: error: expected unqualified-id before ‘[’ token const auto [zit, zins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:60:57: error: expected primary-expression before ‘)’ token const auto [zit, zins] = wordfreq.insert({word, 1}); ^ calc_wordfreq.cpp:61:12: error: ‘zins’ was not declared in this scope if (!zins) { ^ calc_wordfreq.cpp:62:11: error: ‘zit’ was not declared in this scope ++zit->second; ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:20: error: expected unqualified-id before ‘[’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:20: error: expected ‘;’ before ‘[’ token calc_wordfreq.cpp:82:21: error: ‘k’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp:82:24: error: ‘v’ was not declared in this scope for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In lambda function: calc_wordfreq.cpp:82:27: error: expected ‘{’ before ‘:’ token for (const auto &[k, v] : sorted_unigram_freq) { ^ calc_wordfreq.cpp: In function ‘int main(int, char)’: calc_wordfreq.cpp:82:27: error: expected ‘;’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token calc_wordfreq.cpp:82:27: error: expected ‘)’ before ‘:’ token calc_wordfreq.cpp:82:27: error: expected primary-expression before ‘:’ token (pytorch1.5) zqw@DGX:~/TUPE/preprocess/pretrain$ vim calc_wordfreq.cpp

guolinke commented 4 years ago

please ensure that you have c++ compiler to compile calc_wordfreq.cpp. you can install the latest g++, with c++17 support.

https://github.com/guolinke/TUPE/blob/f8282b293235284be4d07469cad7b078fa60e95f/preprocess/pretrain/process.sh#L22

wymxz commented 4 years ago

谢谢你的解答，但我的训练数据找不到book_corpus的训练数据，可以发我吗

guolinke commented 4 years ago

refer to https://github.com/guolinke/TUPE/issues/3#issuecomment-667992663 .

wymxz commented 4 years ago

我终于完成数据预处理了，谢谢您的回答和帮助，我想问一下按照你的参数在4块16g的v100上要跑多久能跑完？

guolinke commented 4 years ago

@wymxz 估计比较久，我这边用 8 个 16G 的v100 (with nvlink)，差不多要六七天。

wyu97 commented 3 years ago

Hi Guolin,

Have you ever met this issue when running the process.sh ?

It comes from the line 72 in process.sh.

./fastbpe applybpe \
  $DATA_DIR/corpus.train.tok.bpe \
  $DATA_DIR/corpus.train.tok.tmp \
  $DATA_DIR/bpe-code

guolinke commented 3 years ago

@wyu97 not actually, it seems there are something wrong in data. You can also try bert-tokenizer from huggingface, it is much easier to use.

wyu97 commented 3 years ago

@guolinke So, the difference between file .bpe and .tmp is whether the words in the file are tokenized, right? In other words, all lines are not changed, only words are divided into subwords. Since I cannot run the code now (it still raises the error and I have not fixed it yet), I do not know what is the /bpe file looks like.

guolinke / TUPE

我跑你的代码，数据集那些怎么弄，怎么装载字典，我没有dict.txt #2