Open gregtatum opened 6 months ago
I read up on #38 which states that part of the design is to augment based off of whitespace splitting. I'm unsure what would be the best way to preserve the original alignment information.
Perhaps you could map each alignment along the way, or maybe just assume that tokenization counting up the original alignments and applying the offset would generate a correct result.
In the merge sentences modifiers, it uses whitespace tokenization:
https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L12-L17
And then counts the tokens to perform offsetting for the alignments:
https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L28-L31
However, for non-whitespace segmented languages, and for training using subword tokenization, this is not correct. The fix here would be to provide a tokenizer configuration that could create correct tokenization, such as a SentencePiece tokenizer.