Closed greyblake closed 3 years ago
I'm a native Japanese speaker. Feel free to mention me when you need help.
Your algorithm sounds good. 25% seems enough and lesser might be okay.
@KitaitiMakoto Thanks for the feedback! Yea, I just wanted to double check if my idea is something meaningful. I am refactoring right now in order to implement and test that plan.
@KitaitiMakoto This seems to be a big improvement for Japanese detection! I added benchmarks to https://github.com/greyblake/whatlang-rs/pull/89 Thank you!
Great!
At the moment Japanese remains the only language that gives poor results even with long texts.
It seems to be due to many chinese characters.
See article https://eastasiastudent.net/regional/hanzi-and-kanji/
The detection algorithm could be probably adjust in the following way: