I got a error when I train DocRepair model in whole train dataset that you provided

nomadlx commented 4 years ago

Code version: 38142bd The md5sum of docrepair train dataset:

train.src: 437bec0e3a326bcc1b8bcdff09adb29b
train.dst: 2ff6ca89f41e857b834b6f1cf15798d9

The error when I train DocRepair model: train_docrepair_error.log

I had try to train in other two dataset, it runs well. The two dataset is generate as follow:

A subset(the first 100k lines) of whole train dataset you provided.
I copy the above dataset(as describe in 1) 240 times to form a big dataset(24m lines).

But I also run failed in a subset(the first 3m lines) of whole train dataset you provided.

So I think the error is not caused by code or runtime environment.

nomadlx commented 4 years ago

I have found the error in train.src line 271101:

онфукиру , нрберт лем . _eosским ? _eos что ты делаешь ? _eos не могу дождаться . _eos_eos онфукиру , нрберт лем .:7 _eos онфукиру , нрберт лем .:5 _eos онфукиру , нрберт лем .:2 _eos онфукиру , нрберт лем .:2 _eos онфукиру , н формы лем .:1 _eos онфукиру , нэрберт лем .:1 _eos онфукуро , нрберт лем .:1 _eos онфукуру , нрберту лем .:1 _eos_eosским ?:20 _eos_eos что ты делаешь ?:20 _eos_eos не могу дождаться .:14 _eos жду не дождусь .:5 _eos я не могу дождаться .:1

The " _eos_eos" is connect to "ским ?:20" without blank. It will lead to a IndexError in function maxlen_monolingual_repair which in ./lib/task/seq2seq/data.py file.

And it seem not only this one instance will make program error in train.src.

lena-voita commented 4 years ago

Thank you for pointing this out! (The problem appeared when removing bpe segmentation from the data my models were trained on). I've updated the data, now there shouldn't be any problem.

Side note: In the message above, you showed the line from the data we released. I do not see subword segmentation (e.g., BPE) used - are you trying to train without any subword segmentation? I suggest you to segment the data, otherwise your translation system won't perform well.

nomadlx commented 4 years ago

Thank for your remind about subword segmentation, i will use subword data to train model in my formal experiment.

lena-voita / good-translation-wrong-in-context

I got a error when I train DocRepair model in whole train dataset that you provided #3