I download this data but meet some problem in train:
first , I notice that in given EN-VI example dataset the English sentence is like :
"That report was written by 620 scientists from 40 countries ."
But the data from training-parallel-nc-v13 or wmt17-metrics-task the English sentence is like:
"And now, finally, an Olympic champion."
the '.' ',' is follow closely the last word without space. Does it effect the performance. Do I need process the txt first? Need separate text from punctuation?
I build the vocab file with the sort word count and add in the begin the result is :
I get 16000+ word in newstest2017-enzh-ref.zh. Is that right? Do I need reduce the vocab number?
second, how to process the Chinese txt? How to build Chinese vocab?
Some issue talk about some related thing #22
But there are no details introduced. How I can process the Chinese txt. With NLTK?
By the way , Is the dev file and text file the same affect the result?
I process Chinese txt with jieba from:
https://github.com/fxsjy/jieba
I train the model with wmt16.json. But the result is't performance very well.
I am going to use large txt train data and try other hparams.
Hi, I want train this model in ZH-EN dataset from http://www.statmt.org/wmt17/metrics-task.html or http://statmt.org/wmt18/translation-task.html
I download this data but meet some problem in train: first , I notice that in given EN-VI example dataset the English sentence is like : "That report was written by 620 scientists from 40 countries ." in the begin the result is : I get 16000+ word in newstest2017-enzh-ref.zh. Is that right? Do I need reduce the vocab number?
But the data from training-parallel-nc-v13 or wmt17-metrics-task the English sentence is like: "And now, finally, an Olympic champion." the '.' ',' is follow closely the last word without space. Does it effect the performance. Do I need process the txt first? Need separate text from punctuation? I build the vocab file with the sort word count and add
second, how to process the Chinese txt? How to build Chinese vocab? Some issue talk about some related thing #22 But there are no details introduced. How I can process the Chinese txt. With NLTK?
By the way , Is the dev file and text file the same affect the result?
Thank you very much!