hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Truecaser Known Case Tokens #16

Closed pypae closed 5 years ago

pypae commented 5 years ago

If a word is not the first word of the sentence, and the word was seen with this exact casing in the training material, the original script does not recase the word.

i.e.

perl train - truecaser.perl --model big.model --corpus big.txt
echo "THE ADVENTURES OF SHERLOCK HOLMES" | perl truecase.perl --model big.model
the ADVENTURES OF SHERLOCK HOLMES
alvations commented 5 years ago

This I'm not sure why would that be expected, yes if we were to emulate that, it should keep the caps cases but it's strange that truecasing didn't catch that in Moses.

alvations commented 5 years ago

This should be patched by #39

Thanks @Patdue for spotting the issue!