Closed ozancaglayan closed 6 years ago
Shouldn't things like this be optional? The Moses tokenizer is pretty much a standard tool by now. You are introducing backwards-incompatibility. BPE or other subword methods are handling cases like this just fine, so a performance improvement is unlikely at the cost of predictability.
The correct working of tokenizer should not be relying on post-fixes that may be done by tools like subword-nmt or sentencepiece, etc right? I noticed this when I was working completely on word-level. But you may be right, this is a behavior change. But in that case, Moses tokenizer should be frozen and no longer be modified. So let's hear from other people as well.
Thanks for the comment :)
I am actually for freezing it. Regardless of correctness.
Also, isn't there a whole bunch of regression tests that would now blow up? This affects the entire downstream path.
I would side with Ozan on this. If people want backward compatibility, they can add a flag --old-behaviour to the tokenizer script, or use a previous version of Moses. This change looks like it would result in better translation so I am minded to make it the default
I don't think there is a test for the tokenizer. But in any case, regression tests are there to ensure that we don't accidently change something, not to prevent us from making a positive change. We will change the tests if needed
Any other comments appreciated
To add ' --old-behaviour' to your work-flow you would still need to figure out where the sudden change is coming from. This would happen without warning.
Btw not only final fullstop has this problem, apostrophe too. https://github.com/alvations/sacremoses/issues/3
pulling as it looks like it will improve translation. Backward compatbility is a secondary priority imo
@alvations did you fix it in sacremoses?
It's night time here 😉 But I'll work on an equivalent patch with some backwards compatibility in sacremoses when I'm awake.
Could we give some versioning to the Moses tokenizer? Then it'll be easy for ports and people to refer to which version they are using
@alvations Is this a version of the sacremoses tokenizer? https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/python-tokenizer/moses.py It might be better to delete it from moses and tell people about sacremoses instead.
I was also on the market for a python moses tokenizer and added this to moses https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer/mosestokenizer based on Luis Gomes' code https://pypi.org/project/mosestokenizer/
Hmmm, tricky, I was looking into too and didn't spot much difference until @patdue raised the issue if the last token is one of the non-splitting string. https://github.com/alvations/sacremoses/issues/21
Allow tokenization of non-breaking prefixes at end of sentences. This should be a fair compromise in many cases to construct a cleaner vocabulary.
EN-old: So am I. EN-new: So am I .
DE-old: ... schwer wie ein iPhone 5. DE-new: ... schwer wie ein iPhone 5 .
FR-old: Des gens admirent une œuvre d' art. FR-new: Des gens admirent une œuvre d' art .
CS-old: Dvě děti, které běží bez bot. CS-new: Dvě děti, které běží bez bot .