Open PolarLion opened 6 years ago
Hi,
Thanks for pointing that.
As the corpus is automatically extracted from bilingual subtitles, there would be some noise in training data. You could directly filter this kind of sentences on both sides. Considering 2M sentence pairs in training data, these 4K sentences will not affect the model too much.
We will also keep on cleaning the data, and release them in the next version.
Cheers, Longyue
Hi, I found some Chinese sentences (about 4000 sentences) in train.en file. for example
I'm not sure if these bugs will affect other parallel data.
Thanks