Closed nomadlx closed 4 years ago
I have found the error in train.src line 271101:
онфукиру , нрберт лем . _eosским ? _eos что ты делаешь ? _eos не могу дождаться . _eos_eos онфукиру , нрберт лем .:7 _eos онфукиру , нрберт лем .:5 _eos онфукиру , нрберт лем .:2 _eos онфукиру , нрберт лем .:2 _eos онфукиру , н формы лем .:1 _eos онфукиру , нэрберт лем .:1 _eos онфукуро , нрберт лем .:1 _eos онфукуру , нрберту лем .:1 _eos_eosским ?:20 _eos_eos что ты делаешь ?:20 _eos_eos не могу дождаться .:14 _eos жду не дождусь .:5 _eos я не могу дождаться .:1
The " _eos_eos" is connect to "ским ?:20" without blank.
It will lead to a IndexError
in function maxlen_monolingual_repair
which in ./lib/task/seq2seq/data.py file.
And it seem not only this one instance will make program error in train.src.
Thank you for pointing this out! (The problem appeared when removing bpe segmentation from the data my models were trained on). I've updated the data, now there shouldn't be any problem.
Side note: In the message above, you showed the line from the data we released. I do not see subword segmentation (e.g., BPE) used - are you trying to train without any subword segmentation? I suggest you to segment the data, otherwise your translation system won't perform well.
Thank for your remind about subword segmentation, i will use subword data to train model in my formal experiment.
Code version: 38142bd The md5sum of docrepair train dataset:
The error when I train DocRepair model: train_docrepair_error.log
I had try to train in other two dataset, it runs well. The two dataset is generate as follow:
But I also run failed in a subset(the first 3m lines) of whole train dataset you provided.
So I think the error is not caused by code or runtime environment.