Improve Japanese detection quality

greyblake commented 3 years ago

At the moment Japanese remains the only language that gives poor results even with long texts.

It seems to be due to many chinese characters.

LANG	AVG	<= 20	21-50	51-100	> 100
Japanese	54.05%	52.94%	55.77%	55.55%	51.95%

See article https://eastasiastudent.net/regional/hanzi-and-kanji/

Chinese is written entirely in hanzi, and Japanese makes heavy use of Chinese characters.

The detection algorithm could be probably adjust in the following way:

If text contains only Mandarin characters => It's Chinese
If text contains Mandarin and big portion of Katakana or Hiragana (at least 25%) => it's Japanese

KitaitiMakoto commented 3 years ago

I'm a native Japanese speaker. Feel free to mention me when you need help.

Your algorithm sounds good. 25% seems enough and lesser might be okay.

greyblake commented 3 years ago

@KitaitiMakoto Thanks for the feedback! Yea, I just wanted to double check if my idea is something meaningful. I am refactoring right now in order to implement and test that plan.

greyblake commented 3 years ago

@KitaitiMakoto This seems to be a big improvement for Japanese detection! I added benchmarks to https://github.com/greyblake/whatlang-rs/pull/89 Thank you!

KitaitiMakoto commented 3 years ago

Great!

greyblake / whatlang-rs

Improve Japanese detection quality #88