Strange results for Chinese with Japanese

pemistahl / lingua-go

The most accurate natural language detection library for Go, suitable for short text and mixed-language text

Apache License 2.0

1.19k stars 66 forks source link

Strange results for Chinese with Japanese #38

Closed 71sprite closed 1 month ago

71sprite commented 1 year ago

To reproduce:

package main

import (
    "github.com/pemistahl/lingua-go"
    "fmt"
)

func main() {
    detector := lingua.NewLanguageDetectorBuilder().
        FromAllLanguages().
        Build()

    text := "上海大学是一个好大学. わー!"
    if language, exists := detector.DetectLanguageOf(text); exists {
        fmt.Println(language.String()) // Japanese
    }
}

Expected: Get Chinese for this case.

https://github.com/pemistahl/lingua-go/blob/main/detector.go#L467

It's because here return Japanese if any japaneseCharacterSet char exists, I'm unsure if this is intended.

Thanks for awesome work!

pemistahl commented 1 year ago

Hi @71sprite, thanks for your request.

I'm aware of the difficulties to recognize Chinese and Japanese correctly. These are actually the most difficult languages. I will try to improve the algorithm but as I'm not a speaker of these languages, it's not easy. If you know how to speak these languages and have ideas for heuristics to implement, I will be glad to read about them.

71sprite commented 1 year ago

I have also read some documents List_of_Unicode_characters , it is indeed impossible to accurately distinguish among Chinese, Japanese and Korean. Perhaps we can judge according to the Unicode range.

func isChinese(c rune) bool {
    // Chinese Unicode range
    if (c >= '\u3400' && c <= '\u4db5') || // CJK Unified Ideographs Extension A
        (c >= '\u4e00' && c <= '\u9fed') || // CJK Unified Ideographs
        (c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
        return true
    }

    return false
}

func isJapanese(c rune) bool {
    // Japanese Unicode range
    if (c >= '\u3021' && c <= '\u3029') || // Japanese Hanzi
        (c >= '\u3040' && c <= '\u309f') || // Hiragana
        (c >= '\u30a0' && c <= '\u30ff') || // Katakana
        (c >= '\u31f0' && c <= '\u31ff') || // Katakana Phonetic Extension
        (c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
        return true
    }

    return false
}

lyricat commented 1 year ago

As a speaker of Chinese and Japanese, I vote for @71sprite

pemistahl commented 1 month ago

Closed in favor of #68. The Rust implementation will contain improvements for the distinction of Chinese and Japanese in the next version 1.7.0, to be released still in this year.