Open BLKSerene opened 11 months ago
Hi, thanks for this addition!
Do you have some example sentences that trigger the added regular expressions and (ideally) some of the non-breaking prefixes unique to this language? In the future, I'd like to add tests for all supported languages so we can make sure we don't break/change anything by accident.
I don't speak Tetun Dili, so hope that these tests work as expected...
I noticed there's a test sentence in the original mosesdecoder pull request but when I try that it yields a different output on the Perl and the Python implementations. The original pull request (and what's currently in the moses tokenizer) is also different.
I'll dig a bit deeper to see whether I can find out why #114 decided to implement it differently, I'm tempted to stick to what's in the old Moses repo unless there's a very good reason not to.
Hi, any updates on this? Shall I close this PR? Or I can modify this PR to only update the nonbreaking prefixes.
This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.