tokenizer.perl: split final dots unconditionally

moses-smt / mosesdecoder

Moses, the machine translation system

http://www.statmt.org/moses

GNU Lesser General Public License v2.1

1.58k stars 778 forks source link

tokenizer.perl: split final dots unconditionally #204

Closed ozancaglayan closed 6 years ago

ozancaglayan commented 6 years ago

Allow tokenization of non-breaking prefixes at end of sentences. This should be a fair compromise in many cases to construct a cleaner vocabulary.

EN-old: So am I. EN-new: So am I .

DE-old: ... schwer wie ein iPhone 5. DE-new: ... schwer wie ein iPhone 5 .

FR-old: Des gens admirent une œuvre d' art. FR-new: Des gens admirent une œuvre d' art .

CS-old: Dvě děti, které běží bez bot. CS-new: Dvě děti, které běží bez bot .

emjotde commented 6 years ago

Shouldn't things like this be optional? The Moses tokenizer is pretty much a standard tool by now. You are introducing backwards-incompatibility. BPE or other subword methods are handling cases like this just fine, so a performance improvement is unlikely at the cost of predictability.

ozancaglayan commented 6 years ago

The correct working of tokenizer should not be relying on post-fixes that may be done by tools like subword-nmt or sentencepiece, etc right? I noticed this when I was working completely on word-level. But you may be right, this is a behavior change. But in that case, Moses tokenizer should be frozen and no longer be modified. So let's hear from other people as well.

Thanks for the comment :)

emjotde commented 6 years ago

I am actually for freezing it. Regardless of correctness.

emjotde commented 6 years ago

Also, isn't there a whole bunch of regression tests that would now blow up? This affects the entire downstream path.

hieuhoang commented 6 years ago

I would side with Ozan on this. If people want backward compatibility, they can add a flag --old-behaviour to the tokenizer script, or use a previous version of Moses. This change looks like it would result in better translation so I am minded to make it the default

I don't think there is a test for the tokenizer. But in any case, regression tests are there to ensure that we don't accidently change something, not to prevent us from making a positive change. We will change the tests if needed

Any other comments appreciated

emjotde commented 6 years ago

To add ' --old-behaviour' to your work-flow you would still need to figure out where the sudden change is coming from. This would happen without warning.

alvations commented 6 years ago

Btw not only final fullstop has this problem, apostrophe too. https://github.com/alvations/sacremoses/issues/3

hieuhoang commented 6 years ago

pulling as it looks like it will improve translation. Backward compatbility is a secondary priority imo

hieuhoang commented 6 years ago

@alvations did you fix it in sacremoses?

alvations commented 6 years ago

It's night time here 😉 But I'll work on an equivalent patch with some backwards compatibility in sacremoses when I'm awake.

Could we give some versioning to the Moses tokenizer? Then it'll be easy for ports and people to refer to which version they are using

hieuhoang commented 6 years ago

@alvations Is this a version of the sacremoses tokenizer? https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/python-tokenizer/moses.py It might be better to delete it from moses and tell people about sacremoses instead.

I was also on the market for a python moses tokenizer and added this to moses https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer/mosestokenizer based on Luis Gomes' code https://pypi.org/project/mosestokenizer/

alvations commented 5 years ago

Hmmm, tricky, I was looking into too and didn't spot much difference until @patdue raised the issue if the last token is one of the non-splitting string. https://github.com/alvations/sacremoses/issues/21