Closed BLKSerene closed 5 years ago
@BLKSerene by kanas, you meant kata-kana? If so, it's surely CJK.
Also, it's following charIsCJK
The CJK checks should have already been following http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane from https://github.com/moses-smt/mosesdecoder/blob/8c5eaa1a122236bbf927bde4ec610906fea599e6/scripts/tokenizer/detokenizer.perl#L316
If there's anything that you see is wrong about the is_cjk()
, feel free to patch it with a PR if you would like to patch the is_cjk()
though.
@alvations Hi, I've created a pull request to update is_cjk
, please take a look.
Close via #14
Currently
sacremoses.util.is_cjk
treats japanese kanas as CJK characters which I suppose should be excluded.Maybe it is better to use https://en.wikipedia.org/wiki/Unicode_block as the reference instead of https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane (given in the docstring) and enumerate unicode code points of all character under "Hangul" and "Han" scripts.
And I'm not sure whether Tibetan characters and scripts like Nushu (女书) should be treated as CJK characters (I'm not an expert in unicode).