greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
965 stars 108 forks source link

Improve Japanese detection quality #88

Closed greyblake closed 3 years ago

greyblake commented 3 years ago

At the moment Japanese remains the only language that gives poor results even with long texts.

It seems to be due to many chinese characters.

LANG AVG <= 20 21-50 51-100 > 100
Japanese 54.05% 52.94% 55.77% 55.55% 51.95%

See article https://eastasiastudent.net/regional/hanzi-and-kanji/

Chinese is written entirely in hanzi, and Japanese makes heavy use of Chinese characters.

The detection algorithm could be probably adjust in the following way:

KitaitiMakoto commented 3 years ago

I'm a native Japanese speaker. Feel free to mention me when you need help.

Your algorithm sounds good. 25% seems enough and lesser might be okay.

greyblake commented 3 years ago

@KitaitiMakoto Thanks for the feedback! Yea, I just wanted to double check if my idea is something meaningful. I am refactoring right now in order to implement and test that plan.

greyblake commented 3 years ago

@KitaitiMakoto This seems to be a big improvement for Japanese detection! I added benchmarks to https://github.com/greyblake/whatlang-rs/pull/89 Thank you!

KitaitiMakoto commented 3 years ago

Great!