pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
712 stars 65 forks source link

Simplified & Traditional Chinese #192

Open reececomo opened 11 months ago

reececomo commented 11 months ago

We should treat Simplified Chinese and Traditional Chinese as two completely seperate languages. There's obviously more nuance to it than that, but echoing #6 and #80, its super important to distinguish between the two very different overarching types of Chinese.

It's close to saying "well both English and German look the same to me so they must be interoperable" 😆

A code of zh conventionally refers to Simplified Chinese.

# Common Simplified Chinese Codes
zh
zh-Hans
zh-CN (Mainland China variant - Historically used for all Simplified Chinese)

# Common Traditional Chinese Codes
zh-Hant
zh-TW (Taiwan variant - Historically used for all Traditional Chinese)
zh-HK (Hong Kong variant)

If it helps you from a training data point of view, they're two totally different ISO Language Scripts (Hans vs Hant).

pemistahl commented 11 months ago

Hello Reece,

thank you for this clarification. I wasn't aware of the fact that simplified and traditional Chinese are as different as English and German, for instance. I will try to find better training data for each variant and let the library handle the variants separately.

jibaro commented 7 months ago

Hello @pemistahl , the difference between simplified and traditional characters is very important for Chinese. When can you support it? -_-!