How to process Chinese txt in Chinese-English traslation

Hi, I want train this model in ZH-EN dataset from http://www.statmt.org/wmt17/metrics-task.html or http://statmt.org/wmt18/translation-task.html

I download this data but meet some problem in train： first , I notice that in given EN-VI example dataset the English sentence is like : "That report was written by 620 scientists from 40 countries ."
But the data from training-parallel-nc-v13 or wmt17-metrics-task the English sentence is like: "And now, finally, an Olympic champion." the '.' ',' is follow closely the last word without space. Does it effect the performance. Do I need process the txt first? Need separate text from punctuation? I build the vocab file with the sort word count and add in the begin the result is : I get 16000+ word in newstest2017-enzh-ref.zh. Is that right? Do I need reduce the vocab number?

second, how to process the Chinese txt? How to build Chinese vocab? Some issue talk about some related thing #22 But there are no details introduced. How I can process the Chinese txt. With NLTK?

By the way , Is the dev file and text file the same affect the result?

Thank you very much!

tensorflow / nmt

How to process Chinese txt in Chinese-English traslation #435