Korean vs Chinese detection

malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.

Apache License 2.0

5 stars 5 forks source link

When trying to detect the language with the following input:

"保險公司條例草案月底首讀 
政府建議成立獨立保險業監管局"

it gives me Korean instead of Chinese as detection result with > 0.99999 
confidence.

This also happens with most other Chinese (zh-tw) texts, although sometimes 
zh-tw gets listed with about 0.15 confidence.

By removing the Korean profile, zh-tw correctly becomes the detection result 
with > 0.9999 confidence.

This seems odd, since the Korean profile is completely different from the given 
input string.

Original issue reported on code.google.com by andreas....@oximity.com on 16 Apr 2014 at 1:53

I'm seeing the same issue with the string " 之前為帳單交易作業區已變更廣告內容之前為銷售代表之前為張貼日期為百分比之前為合約為目標對象條件已刪除結束日期之前為" from the cld2 test suite. To me is odd because the script is different.

malcolmgreaves / language-detection

Korean vs Chinese detection #66