marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.25k stars 233 forks source link

sentence piece cannot get bleu-detok score when training. #235

Open zhonghao0077 opened 5 years ago

zhonghao0077 commented 5 years ago

Currently I am trying to use sentence piece (following the sentence piece tutorial)as the pre-processing procedure to handle English to Japanese translation training. I've tried to use pure sentence piece without using Mecab to tokenize Japanese, and I always get the bleu-detok score as 0 when validating the training. The training has stopped in 1 epoch. Then I tried to use Mecab to pre-detoken the Japanese, and the bleu-detok keeps lower than 2 while training, the training stopped in 3 epochs. I have checked the data, when using mecab and bpe, the training keeps going and the validation procedure is correct.

train.conf.txt train.log.txt

emjotde commented 5 years ago

Hm, the config looks alright. I cannot see anything obvious that could be wrong. Can you maybe share the data so I could try myself?

zhonghao0077 commented 5 years ago

sorry for the late reply. I am sorry to tell you that I cannot provide the data to you. I've find an alternative data to track the problem. You can download it from https://goo.gl/idaoxo, and the data is from https://nlp.stanford.edu/projects/jesc/ in official_split category. I've tried to train my model with this data, the bleu-detok was still keeping 0 while training.

alvations commented 5 years ago

Heads up on JESC, it's pretty noisy =)

zhonghao0077 commented 5 years ago

at least the bleu-detok shouldn't be 0. It's just an experiment to help us to locate the problem.