hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Add tokenization for Tetun Dili (tdt) #144

Open BLKSerene opened 11 months ago

BLKSerene commented 11 months ago

This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.

jelmervdl commented 11 months ago

Hi, thanks for this addition!

Do you have some example sentences that trigger the added regular expressions and (ideally) some of the non-breaking prefixes unique to this language? In the future, I'd like to add tests for all supported languages so we can make sure we don't break/change anything by accident.

BLKSerene commented 11 months ago

I don't speak Tetun Dili, so hope that these tests work as expected...

jelmervdl commented 11 months ago

I noticed there's a test sentence in the original mosesdecoder pull request but when I try that it yields a different output on the Perl and the Python implementations. The original pull request (and what's currently in the moses tokenizer) is also different.

I'll dig a bit deeper to see whether I can find out why #114 decided to implement it differently, I'm tempted to stick to what's in the old Moses repo unless there's a very good reason not to.

BLKSerene commented 4 months ago

Hi, any updates on this? Shall I close this PR? Or I can modify this PR to only update the nonbreaking prefixes.