greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
965 stars 108 forks source link

Can distinguish between Simplified Chinese and Japanese Kanji? #122

Open miiton opened 2 years ago

miiton commented 2 years ago

I'm from the Meilisearch community. ( related: meilisearch/meilisearch/issues/2403 )

Wouldn't it be possible to distinguish between Simplified Chinese(Mandarin) and Kanji(Japanese) where the strings consists only of Hanzi/Kanji?

For example

whatlang v0.16.0 detects...

Word Cmn Jpn Mean in english
東京 "Tokyo" in Kanji
东京 "Tokyo" in Simplified Chinese
大阪 "Osaka" in both Kanji and Simplified Chiinese
会員 "member, customer" in Kanji
会员 "member, customer" in Simplified Chinese
関西国際空港 "Kansai International Airport" in Kanji
関西国际空港 "Kansai International Airport" in Simplified Chinese

My expected result is...

Word Cmn Jpn Mean in english
東京 "Tokyo" in Kanji
东京 "Tokyo" in Simplified Chinese
大阪 "Osaka" in both Kanji and Simplified Chinese
会員 "member, customer" in Kanji
会员 "member, customer" in Simplified Chinese
関西国際空港 "Kansai International Airport" in Kanji
関西国际空港 "Kansai International Airport" in Simplified Chinese

References

greyblake commented 2 years ago

@miiton Hi, it's not first time the issue is raised regarding Japanese VS Chinese (Mandarin). It was to some extend improved in https://github.com/greyblake/whatlang-rs/pull/45

I'll be honest with you, I have very little knowledge about Chinese and Japanese languages and i would not be able to develop good heuristics to distinguish those languages.

I will take a look at the link you added to see if it helps. On you side: if you know Japanese and Chinese, you can contribute by providing a bigger set of examples, which we could use for unit tests, in the following format:

Thank you.

miiton commented 2 years ago

Thank you for your reply.

I'm not familiar with Chinese at all, but I'll give it some thought.

polm commented 2 years ago

As a speaker of Japanese but not Chinese, one thing about the strings here is that in the ones that should be Chinese, there are simplified Chinese characters that are never used in normal Japanese, like 东, 际, or 员, and even not knowing Chinese I can recognize them immediately. Unfortunately it looks like there's no Unicode property like "this is a simplified character" (there's something that looks similar but isn't useful for this purpose).

One simple way to make a list of these would be to check if the characters are present in older encodings like Shift-JIS / EUC-JP (Japanese) or GP 2312 (Simplified Chinese). You could also take a big chunk of each language (like Wikipedia) and make some cutoff for character occurrence.

miiton commented 2 years ago

As @polm wrote, there is also a way to check whether it is included in Shift-JIS or EUC-JP. I think it's good in terms of accuracy on focus to Japanese, but when I checked it, there were too many characters that duplicated with traditional Chinese characters, so this time, it seems that the percentage of Chinese characters that are misidentified as Japanese will increase.

So, when I focused on "Joyo kanji", I think I got a pretty good result, so I'm trying it out.

If it looks fine, I'll create a PR.

miiton commented 1 year ago

As a result of various investigations, I have almost gave up. The reason is as shown in the image posted on the link, but I thought it would be unrealistic to correspond because Chinese kanji and Japanese kanji overlap too much.

Unless someone else comes up with a very good idea, I think you can close this issue for now.

https://github.com/meilisearch/product/discussions/532#discussioncomment-3705382

OuOu2021 commented 1 year ago

This is not a big problem as a slightly longer text in Japanese is likely to have kana which can help to distinguish between Japanese and Chinese, but it's still incorrect to determine undoubtable Kanji only used in Japanese as Chinese. 😥

ManyTheFish commented 1 year ago

@OuOu2021, it depends on your need, as a search engine developer I have to detect the language in a small string like a search query.