shuyo / language-detection

This is a language detection library implemented in plain Java. (aliases: language identification, language guessing)
https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
727 stars 184 forks source link

Cantonese language support #65

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I tried 40+ languages. Using lang-detector i am able to get languages with 
almost 100% accuracy.

But if i try Cantonese(e.g. 
 十六岁的中学生托特发现,纳粹战犯曲德也一直平静地住��
�当地小镇上,托特对曲德战争时所犯下的罪行深感兴趣,决�
��去勒索曲德,为了托特不告发他,曲德要向托特透露他在过
去战时所犯过的罪行,两人奇特的关系不久便告失控,影片��
�结局是令人震惊), result is Korean or Chinese(Traditional)or Chinese 
(simplified).

Is there any way to detect Cantonese Language ?   

Original issue reported on code.google.com by swami0...@gmail.com on 2 Apr 2014 at 9:43

hugolpz commented 5 years ago

Not by characters : you use Chinese ones. Need statistical or grammatical analysis. Statistical would be better. Need to detect the frequency spike for some characters in Cantonese [iso:yue] texts compared to proper mandarin.