hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Fix detruecaser when the first token is all-caps #108

Closed yuyang-huang closed 3 years ago

yuyang-huang commented 4 years ago

There's a small mismatch between MosesDetruecaser and the original Perl script:

$ echo 'COVID @-@ 19' | perl detruecase.perl
COVID @-@ 19
$ echo 'COVID @-@ 19' | sacremoses detruecase
Covid @-@ 19

It's because that str.capitalize() capitalizes the first character and lowercase the rest.

This PR changes token.capitalize() to token[:1].upper() + token[1:] and adds a unit test for it.

pavelnemirovsky commented 3 years ago

@yuyang-huang indeed a bug, @alvations can you please merge it? Its hurts...

alvations commented 3 years ago

Thank you @yuyang-huang !