longyuewangdcu / tvsub

TVsub: DCU-Tencent Chinese-English Dialogue Corpus
45 stars 9 forks source link

Chinese Sentences in train.en #1

Open PolarLion opened 6 years ago

PolarLion commented 6 years ago

Hi, I found some Chinese sentences (about 4000 sentences) in train.en file. for example

image

I'm not sure if these bugs will affect other parallel data.

Thanks

longyuewangdcu commented 6 years ago

Hi,

Thanks for pointing that.

As the corpus is automatically extracted from bilingual subtitles, there would be some noise in training data. You could directly filter this kind of sentences on both sides. Considering 2M sentence pairs in training data, these 4K sentences will not affect the model too much.

We will also keep on cleaning the data, and release them in the next version.

Cheers, Longyue