Open GoogleCodeExporter opened 9 years ago
Not by characters : you use Chinese ones. Need statistical or grammatical analysis. Statistical would be better. Need to detect the frequency spike for some characters in Cantonese [iso:yue] texts compared to proper mandarin.
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-characters-by-frequency.txt
Bump is : in 3 mins i didn't found clean Cantonese corpus on opus.nlpl.eu. Then, you identify some meaningful characters spikes in Cantonese and not in Mandarin. Then you code a switch for cn vs yue, and add it to the current project or a fork.
Original issue reported on code.google.com by
swami0...@gmail.com
on 2 Apr 2014 at 9:43