Cantonese language support

shuyo / language-detection

This is a language detection library implemented in plain Java. (aliases: language identification, language guessing)

727 stars 184 forks source link

I tried 40+ languages. Using lang-detector i am able to get languages with almost 100% accuracy. But if i try Cantonese(e.g. 　十六岁的中学生托特发现，纳粹战犯曲德也一直平静地住�� 当地小镇上，托特对曲德战争时所犯下的罪行深感兴趣，决� ��去勒索曲德，为了托特不告发他，曲德要向托特透露他在过去战时所犯过的罪行，两人奇特的关系不久便告失控，影片�� 结局是令人震惊), result is Korean or Chinese(Traditional)or Chinese (simplified). Is there any way to detect Cantonese Language ?

Not by characters : you use Chinese ones. Need statistical or grammatical analysis. Statistical would be better. Need to detect the frequency spike for some characters in Cantonese [iso:yue] texts compared to proper mandarin.

For Mandarin characters frequency, use Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. Plos ONE, 5(6), e10729.
For Cantonese characters frequency, I have no knowledge of Cantonese existing frequency list. One could use a Cantonese corpus and some easy characters counting to create a Cantonese character frequency list.
```
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-characters-by-frequency.txt
```
Bump is : in 3 mins i didn't found clean Cantonese corpus on opus.nlpl.eu. Then, you identify some meaningful characters spikes in Cantonese and not in Mandarin. Then you code a switch for cn vs yue, and add it to the current project or a fork.

shuyo / language-detection

Cantonese language support #65