tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.38k stars 1.96k forks source link

How to process Chinese txt in Chinese-English traslation #435

Closed kaidi-jin closed 5 years ago

kaidi-jin commented 5 years ago

Hi, I want train this model in ZH-EN dataset from http://www.statmt.org/wmt17/metrics-task.html or http://statmt.org/wmt18/translation-task.html

I download this data but meet some problem in train: first , I notice that in given EN-VI example dataset the English sentence is like : "That report was written by 620 scientists from 40 countries ."
But the data from training-parallel-nc-v13 or wmt17-metrics-task the English sentence is like: "And now, finally, an Olympic champion." the '.' ',' is follow closely the last word without space. Does it effect the performance. Do I need process the txt first? Need separate text from punctuation? I build the vocab file with the sort word count and add in the begin the result is : image I get 16000+ word in newstest2017-enzh-ref.zh. Is that right? Do I need reduce the vocab number?

second, how to process the Chinese txt? How to build Chinese vocab? Some issue talk about some related thing #22 But there are no details introduced. How I can process the Chinese txt. With NLTK?

By the way , Is the dev file and text file the same affect the result?

Thank you very much!

kaidi-jin commented 5 years ago

I process Chinese txt with jieba from: https://github.com/fxsjy/jieba I train the model with wmt16.json. But the result is't performance very well. I am going to use large txt train data and try other hparams.