optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

Chinese detection is not good #85

Open rococode opened 6 years ago

rococode commented 6 years ago

Chinese language detection is super inaccurate.

Past issues showed two issues: Chinese detection fails when English is involved, and short Chinese phrases are detected as Korean.

Technically, multi-lingual texts and short texts are not totally supported. However, this is actually a much more significant issue, as even long texts are entirely misidentified.

Consider the following text:

2006年,国际天文联合会对行星做出定义,规定行星即为按轨道围绕恒星运动、尺寸大到足以保持流體靜力平衡并且清除邻近的小天体的天体。流体静力平衡天体在尺寸上足以令其引力克服内部刚性,并因此成为圆形(椭球形)。“清除邻近小天体”的实际意义是指卫星大到其引力足以控制附近的所有物体。根据国际天文联会此一定义,太阳系共有8颗行星。所有以轨道围绕太阳运行并保持流体静力平衡,但未能清除附近小天体的天体称为矮行星。除太阳、行星和矮行星外,太阳系内的所有其它天体则称为太阳系小天体。此外,太阳和另外十余颗衛星尺寸也大到足以达成流体静力平衡。除太阳外,这些天体都属于“行星质量天体”,簡稱“行质天体”。

This is classified as Korean and Japanese. I use an algorithm that breaks down the text into chunks to supplement the full text.

Here is the final classification I got:

{de=2.4541738909762207E-4, ru=2.4541738909762207E-4, ko=22.324138907516115, ja=1.6743056353889012, en=2.4541738909762207E-4, it=2.4541738909762207E-4, fr=2.4541738909762207E-4, es=2.4541738909762207E-4}

Chinese is not even listed once, after the detector considers every subphrase (basically split on punctuation) in the text and the full text itself. So it's not just a specific case, but it seems that standard Chinese is almost never even selected as a possibility, much less the most likely option.

@stormisover points out some weird filtering, could that be the problem? https://github.com/optimaize/language-detector/issues/33#issuecomment-351635876

Other relevant issues:

63 #33

Not too keen on using unicode ranges to detect languages so unfortunately gonna have to stick with Google Translate API for now :(

lireagan commented 5 years ago

terrible

chj050322 commented 5 years ago

genprofile的时候,中文可能被按字节分割了,而不是按文字, 所以生成的profiles是错误的

james-s-w-clark commented 4 years ago

I've seen poor CJK detection with Optimaize too. I'm running the text in the first post of this issue through Lingua language detector for #107.

isoCode639_1 = {IsoCode639_1@8287} "zh"
isoCode639_3 = {IsoCode639_3@8288} "zho"
alphabets = {Collections$SingletonSet@8289}  size = 1
 0 = {Alphabet@8294} "HAN"
uniqueCharacters = ""
name = "CHINESE"

Very easy to use library. You say you're using Google Translate API, so it may be possible for you to use Lingua (you're not constrained to Optimaize).

james-s-w-clark commented 4 years ago

@rococode I actually expect you to have 99% accuracy with defaults for this text, using a standard detector.

This text is 293 chars long. #63 saw ~150 char strings (mostly Chinese, some English) get strange results if not forcing short-text language detection.