hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Update sacremoses.util.CJKChars and is_cjk #13

Closed BLKSerene closed 5 years ago

BLKSerene commented 5 years ago

Currently sacremoses.util.is_cjk treats japanese kanas as CJK characters which I suppose should be excluded.

Maybe it is better to use https://en.wikipedia.org/wiki/Unicode_block as the reference instead of https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane (given in the docstring) and enumerate unicode code points of all character under "Hangul" and "Han" scripts.

And I'm not sure whether Tibetan characters and scripts like Nushu (女书) should be treated as CJK characters (I'm not an expert in unicode).

alvations commented 5 years ago

@BLKSerene by kanas, you meant kata-kana? If so, it's surely CJK.

Also, it's following charIsCJK

The CJK checks should have already been following http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane from https://github.com/moses-smt/mosesdecoder/blob/8c5eaa1a122236bbf927bde4ec610906fea599e6/scripts/tokenizer/detokenizer.perl#L316

If there's anything that you see is wrong about the is_cjk(), feel free to patch it with a PR if you would like to patch the is_cjk() though.

BLKSerene commented 5 years ago

@alvations Hi, I've created a pull request to update is_cjk, please take a look.

alvations commented 5 years ago

Close via #14