Merge sentences produces incorrect alignments when used with SentencePiece

hplt-project / OpusTrainer

Curriculum training

MIT License

15 stars 5 forks source link

In the merge sentences modifiers, it uses whitespace tokenization:

https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L12-L17

And then counts the tokens to perform offsetting for the alignments:

https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L28-L31

However, for non-whitespace segmented languages, and for training using subword tokenization, this is not correct. The fix here would be to provide a tokenizer configuration that could create correct tokenization, such as a SentencePiece tokenizer.

hplt-project / OpusTrainer

Merge sentences produces incorrect alignments when used with SentencePiece #53