Open gregtatum opened 8 months ago
It looks like parse_alignments has some validation:
if src_tokens is not None and trg_tokens is not None:
for pair in pairs:
if pair.src < 0 or pair.src >= len(src_tokens) \
or pair.trg < 0 or pair.trg >= len(trg_tokens):
raise ValueError('Out-of-bound alignment pairs')
But I think the alignments should be validated again after being modified. I would also prefer to output to a logger for anything that didn't get validated. It's hard to debug when you are dealing with tens of millions of sentence pairs.
In Marian, invalid alignments leads to a crash, as the index bounds for tokens is not checked. This breaks training. Plus, if alignments are generated incorrectly on the OpusTrainer side, this will degrade the final performance when using guided alignment training. It should be cheap and easy to validate that the alignments are within bounds. If they are out of bounds, then the sentence pair can be discarded with a warning.
This could be done like the following:
One draw-back, is that it still won't catch issues where the wrong tokenization strategy is used like in #53.