hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

"p.m." is not tokenized as in the original script. #21

Open pypae opened 5 years ago

pypae commented 5 years ago

I could not yet figure out why, but in the original script, the dot in p.m. at the end of a sentence is not split up, while with this port it is.

The original script even explicitly leaves out p.m from its nonbreaking prefixes, so i'd expect the behavior seen in the port.

alvations commented 5 years ago

The original script added that new hack that changed quite recently: https://github.com/moses-smt/mosesdecoder/pull/204

This difference isn't accounted for in sacremoses. And I'm really not sure whether we should or not.

ZJaume commented 4 years ago

Why sacremoses shouldn't include this?