pemistahl / lingua-rs

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Apache License 2.0
891 stars 41 forks source link

Add Kanji support #152

Open OuOu2021 opened 1 year ago

OuOu2021 commented 1 year ago

Before we start, I would like to make clear some concepts. Kanji is Japanese character based on Chinese symbols. And I will take Chinese character as a joint name of Simplified Chinese character, Traditional Chinese character and Kanji.

It seems that all Chinese characters will be identified as Chinese with confidence values of 100 percent in Lingua which is not right. In fact, some Kanji words are written entirely the same in Chinese (like 豆腐(tofu), 科学(science)), while some of Kanji are neither used in Simplified Chinese nor Traditional Chinese at all. For example, economy is written as "经济" in Simplified Chinese, "經濟" in Traditional Chinese and "経済" in Kanji, but they are all 100% determined by Lingua 1.4 to be Chinese.

This is not a big problem as a slightly lengthier text like twitter in Japanese is likely to have kana which can help Lingua to distinguish it, but it's still incorrect to determine undoubtable Kanji only used in Japanese as 100% Chinese, so I have to point out it.

Also see greyblake/whatlang-rs/issues/122

OuOu2021 commented 1 year ago
経済: (Chinese, 1.0)
和製漢字: (Chinese, 1.0)
雫: (Chinese, 1.0)
労働: (Chinese, 1.0)
峠: (Chinese, 1.0)
勉強中: (Chinese, 1.0)
自動販売機: (Chinese, 1.0)

They are all 100% Japanese words.

pemistahl commented 1 year ago

Hi @OuOu2021, thank you for reaching out to me. You can probably imagine how difficult it is to solve this problem. The language models I use for Chinese and Japanese are obviously insufficient for words such as your examples. Perhaps it helps to determine which characters are really unique to Chinese or Japanese and to extend the language models with this information. I will try to improve the library in this regard but it may take significant time as the todo list is pretty long already.

RoDmitry commented 1 month ago

Looks like Chinese model was trained on the Traditional Chinese, and doesn't understand Simplified Chinese good enough, and also looks like Chinese model is very slow, so there is a hack to prioritize any found Han character as "Chinese", unless there are Japanese characters. But if you disable crate feature = "chinese", then any Han symbol will be considered Japanese.