Update sacremoses.util.CJKChars and is_cjk

hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer

MIT License

486 stars 59 forks source link

Closed BLKSerene closed 5 years ago

BLKSerene commented 5 years ago

Currently sacremoses.util.is_cjk treats japanese kanas as CJK characters which I suppose should be excluded.

Maybe it is better to use https://en.wikipedia.org/wiki/Unicode_block as the reference instead of https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane (given in the docstring) and enumerate unicode code points of all character under "Hangul" and "Han" scripts.

And I'm not sure whether Tibetan characters and scripts like Nushu (女书) should be treated as CJK characters (I'm not an expert in unicode).

alvations commented 5 years ago

@BLKSerene by kanas, you meant kata-kana? If so, it's surely CJK.

Also, it's following charIsCJK

If there's anything that you see is wrong about the is_cjk(), feel free to patch it with a PR if you would like to patch the is_cjk() though.

BLKSerene commented 5 years ago

@alvations Hi, I've created a pull request to update is_cjk, please take a look.

alvations commented 5 years ago

Close via #14