pytorch / translate

Translate - a PyTorch Language Library
BSD 3-Clause "New" or "Revised" License
826 stars 192 forks source link

How to train a translator from English to Chinese? #278

Closed lucasjinreal closed 5 years ago

lucasjinreal commented 5 years ago

Just wonder the properly data preparation to train a translation model

jmp84 commented 5 years ago

@jinfagang, I would look into the moses tokenizer (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) (the -l zh option is supported I believe). Otherwise, do a google or google scholar search for chinese segmentation.

rasoolims commented 5 years ago

@jinfagang, Also take a look at UDpipe. It has pretrained Chinese tokenizer: http://ufal.mff.cuni.cz/udpipe

I guess Stanford NLP also has Chinese segmentation models.

lucasjinreal commented 5 years ago

@jmp84 @rasoolims Thanks for replying... I know how to segmentation the Chinese, but where to find how to prepare the dataset? I mean, after segmented, how to orgnise them so that it can be feeded into model

jmp84 commented 5 years ago

@jinfagang, can you check the examples?https://github.com/pytorch/translate/tree/master/pytorch_translate/examples especially https://github.com/pytorch/translate/blob/master/pytorch_translate/examples/train_iwslt14.sh

You'll want one training source file, one training target file, one dev source file and one dev target file. Each line in a source file has a translation on the same line on the target file (this is called moses format).

lucasjinreal commented 5 years ago

@jmp84 Thanks, should every line segmented with a space as separator? Should I edit the dataloader for various languages?

jmp84 commented 5 years ago

@jinfagang, yes, space is a separator. No need to edit dataloader. Here is an example in English-Spanish. Source file contains: hello , friends

Target file contains: hola , amigos

lucasjinreal commented 5 years ago

@jmp84 Thank u so much! I think now I understand

kalyangvs commented 5 years ago

@jinfagang Please state the BLEU score you obtained and data you used as in CWMT + UN parallel