Open miiton opened 2 years ago
@miiton Hi, it's not first time the issue is raised regarding Japanese VS Chinese (Mandarin). It was to some extend improved in https://github.com/greyblake/whatlang-rs/pull/45
I'll be honest with you, I have very little knowledge about Chinese and Japanese languages and i would not be able to develop good heuristics to distinguish those languages.
I will take a look at the link you added to see if it helps. On you side: if you know Japanese and Chinese, you can contribute by providing a bigger set of examples, which we could use for unit tests, in the following format:
Thank you.
Thank you for your reply.
I'm not familiar with Chinese at all, but I'll give it some thought.
As a speaker of Japanese but not Chinese, one thing about the strings here is that in the ones that should be Chinese, there are simplified Chinese characters that are never used in normal Japanese, like 东, 际, or 员, and even not knowing Chinese I can recognize them immediately. Unfortunately it looks like there's no Unicode property like "this is a simplified character" (there's something that looks similar but isn't useful for this purpose).
One simple way to make a list of these would be to check if the characters are present in older encodings like Shift-JIS / EUC-JP (Japanese) or GP 2312 (Simplified Chinese). You could also take a big chunk of each language (like Wikipedia) and make some cutoff for character occurrence.
As @polm wrote, there is also a way to check whether it is included in Shift-JIS or EUC-JP. I think it's good in terms of accuracy on focus to Japanese, but when I checked it, there were too many characters that duplicated with traditional Chinese characters, so this time, it seems that the percentage of Chinese characters that are misidentified as Japanese will increase.
So, when I focused on "Joyo kanji", I think I got a pretty good result, so I'm trying it out.
If it looks fine, I'll create a PR.
As a result of various investigations, I have almost gave up. The reason is as shown in the image posted on the link, but I thought it would be unrealistic to correspond because Chinese kanji and Japanese kanji overlap too much.
Unless someone else comes up with a very good idea, I think you can close this issue for now.
https://github.com/meilisearch/product/discussions/532#discussioncomment-3705382
This is not a big problem as a slightly longer text in Japanese is likely to have kana which can help to distinguish between Japanese and Chinese, but it's still incorrect to determine undoubtable Kanji only used in Japanese as Chinese. 😥
@OuOu2021, it depends on your need, as a search engine developer I have to detect the language in a small string like a search query.
I'm from the Meilisearch community. ( related: meilisearch/meilisearch/issues/2403 )
Wouldn't it be possible to distinguish between Simplified Chinese(Mandarin) and Kanji(Japanese) where the strings consists only of Hanzi/Kanji?
For example
whatlang v0.16.0 detects...
My expected result is...
References