hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

detokenization does not add a space between Chinese/Japanese characters and non-CJK characters #71

Closed brandonherzog closed 4 years ago

brandonherzog commented 4 years ago

The original Moses perl scripts add a space between tokens that do not end with a CJK character and tokens that do: https://github.com/moses-smt/mosesdecoder/blob/555829a771cd897bb807f495a95737953a7ca9a3/scripts/tokenizer/detokenizer.perl#L109-L115

The current Python port only adds a space if a token starts with a CJK character and does not end with a CJK character: https://github.com/alvations/sacremoses/blob/4d994b8781f6c10600d34413679e1a1acdb53cb5/sacremoses/tokenize.py#L692-L696

This seems like a mistake and I would expect the original behavior to be replicated.

detokenizer = MosesDetokenizer()
text = detokenizer.detokenize(['Japan', 'is', '日', '本', 'in', 'Japanese', '.'])
assert text == 'Japan is 日本 in Japanese.'
# it actually will currently return 'Japan is日本 in Japanese.' with no space before 日
alvations commented 4 years ago

Patched #72