hplt-project / OpusTrainer

Curriculum training
https://pypi.org/project/opustrainer/
MIT License
15 stars 5 forks source link

Merge sentences produces incorrect alignments when used with SentencePiece #53

Open gregtatum opened 6 months ago

gregtatum commented 6 months ago

In the merge sentences modifiers, it uses whitespace tokenization:

https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L12-L17

And then counts the tokens to perform offsetting for the alignments:

https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L28-L31

However, for non-whitespace segmented languages, and for training using subword tokenization, this is not correct. The fix here would be to provide a tokenizer configuration that could create correct tokenization, such as a SentencePiece tokenizer.

gregtatum commented 6 months ago

I read up on #38 which states that part of the design is to augment based off of whitespace splitting. I'm unsure what would be the best way to preserve the original alignment information.

Perhaps you could map each alignment along the way, or maybe just assume that tokenization counting up the original alignments and applying the offset would generate a correct result.