microsoft / MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation
https://arxiv.org/pdf/1905.02450.pdf
Other
1.12k stars 206 forks source link

How to prepare data pipeline and utilize the provided BPE codes? #135

Closed liuchongming74 closed 4 years ago

liuchongming74 commented 4 years ago

Hola fellows, thanks for opening the MASS source. I'm fresh to NLP area. So, I have some confusions on preprocess the corpus. If it's convenient, could you answer my questions?

According to the README file, data has to be constructed as below, so the train.en and train.zh are the sentences have been encoded with bpe codes?

Or, train.en and train.zh are just splitted from provided bpe codes file?

hanktseng131415go commented 4 years ago

I am stuck this problem as well. Can you let me know how you solve it? Many thanks

liuchongming74 commented 4 years ago

I am stuck this problem as well. Can you let me know how you solve it? Many thanks

I guess the dict.zh.txt and dict.en.txt were learned using subword-nmt (pip install subword-nmt). Besides, according to the bpe codes that author provided, I just simplily use the following command to create train.en, train.zh, valid.en, valid.zh file using bpe encoding:

subword-nmt apply-bpe -c all.en.bpe.codes -i train.en.txt -o train.en

train.en.txt could be your own text file.

hanktseng131415go commented 4 years ago
subword-nmt

Thanks for your reply, first of all. Based on your suggesting bpe encoding, if I just use thedict.en.txt, the command code is as follow?:

subword-nmt apply-bpe -c all.en.bpe.codes -i dict.en.txt -o train.en

Also, the valid.en and valid.zh also follow the code as below? :

subword-nmt apply-bpe -c all.en.bpe.codes -i dict.en.txt -o valid.en