Closed 71sprite closed 1 month ago
Hi @71sprite, thanks for your request.
I'm aware of the difficulties to recognize Chinese and Japanese correctly. These are actually the most difficult languages. I will try to improve the algorithm but as I'm not a speaker of these languages, it's not easy. If you know how to speak these languages and have ideas for heuristics to implement, I will be glad to read about them.
I have also read some documents List_of_Unicode_characters , it is indeed impossible to accurately distinguish among Chinese, Japanese and Korean. Perhaps we can judge according to the Unicode range.
func isChinese(c rune) bool {
// Chinese Unicode range
if (c >= '\u3400' && c <= '\u4db5') || // CJK Unified Ideographs Extension A
(c >= '\u4e00' && c <= '\u9fed') || // CJK Unified Ideographs
(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
return true
}
return false
}
func isJapanese(c rune) bool {
// Japanese Unicode range
if (c >= '\u3021' && c <= '\u3029') || // Japanese Hanzi
(c >= '\u3040' && c <= '\u309f') || // Hiragana
(c >= '\u30a0' && c <= '\u30ff') || // Katakana
(c >= '\u31f0' && c <= '\u31ff') || // Katakana Phonetic Extension
(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
return true
}
return false
}
As a speaker of Chinese and Japanese, I vote for @71sprite
Closed in favor of #68. The Rust implementation will contain improvements for the distinction of Chinese and Japanese in the next version 1.7.0, to be released still in this year.
To reproduce:
Expected: Get
Chinese
for this case.It's because here return
Japanese
if any japaneseCharacterSet char exists, I'm unsure if this is intended.Thanks for awesome work!