Closed liuchongming74 closed 4 years ago
I am stuck this problem as well. Can you let me know how you solve it? Many thanks
I am stuck this problem as well. Can you let me know how you solve it? Many thanks
I guess the dict.zh.txt
and dict.en.txt
were learned using subword-nmt (pip install subword-nmt).
Besides, according to the bpe codes that author provided, I just simplily use the following command to create train.en
, train.zh
, valid.en
, valid.zh
file using bpe encoding:
subword-nmt apply-bpe -c all.en.bpe.codes -i train.en.txt -o train.en
train.en.txt
could be your own text file.
subword-nmt
Thanks for your reply, first of all. Based on your suggesting bpe encoding, if I just use the
dict.en.txt
, the command code is as follow?:subword-nmt apply-bpe -c all.en.bpe.codes -i dict.en.txt -o train.en
Also, the valid.en
and valid.zh
also follow the code as below? :
subword-nmt apply-bpe -c all.en.bpe.codes -i dict.en.txt -o valid.en
Hola fellows, thanks for opening the MASS source. I'm fresh to NLP area. So, I have some confusions on preprocess the corpus. If it's convenient, could you answer my questions?
According to the README file, data has to be constructed as below, so the
train.en
andtrain.zh
are the sentences have been encoded withbpe codes
?Or,
train.en
andtrain.zh
are just splitted from provided bpe codes file?data/ ├─ mono/ | ├─ train.en | ├─ train.zh | ├─ valid.en | ├─ valid.zh | ├─ dict.en.txt | └─ dict.zh.txt └─ para/ ├─ train.en ├─ train.zh ├─ valid.en ├─ valid.zh ├─ dict.en.txt └─ dict.zh.txt
Look forward to your favourable reply.