Open rococode opened 6 years ago
terrible
genprofile的时候,中文可能被按字节分割了,而不是按文字, 所以生成的profiles是错误的
I've seen poor CJK detection with Optimaize too. I'm running the text in the first post of this issue through Lingua language detector for #107.
isoCode639_1 = {IsoCode639_1@8287} "zh"
isoCode639_3 = {IsoCode639_3@8288} "zho"
alphabets = {Collections$SingletonSet@8289} size = 1
0 = {Alphabet@8294} "HAN"
uniqueCharacters = ""
name = "CHINESE"
Very easy to use library. You say you're using Google Translate API, so it may be possible for you to use Lingua (you're not constrained to Optimaize).
@rococode I actually expect you to have 99% accuracy with defaults for this text, using a standard detector.
This text is 293 chars long. #63 saw ~150 char strings (mostly Chinese, some English) get strange results if not forcing short-text language detection.
Chinese language detection is super inaccurate.
Past issues showed two issues: Chinese detection fails when English is involved, and short Chinese phrases are detected as Korean.
Technically, multi-lingual texts and short texts are not totally supported. However, this is actually a much more significant issue, as even long texts are entirely misidentified.
Consider the following text:
This is classified as Korean and Japanese. I use an algorithm that breaks down the text into chunks to supplement the full text.
Here is the final classification I got:
{de=2.4541738909762207E-4, ru=2.4541738909762207E-4, ko=22.324138907516115, ja=1.6743056353889012, en=2.4541738909762207E-4, it=2.4541738909762207E-4, fr=2.4541738909762207E-4, es=2.4541738909762207E-4}
Chinese is not even listed once, after the detector considers every subphrase (basically split on punctuation) in the text and the full text itself. So it's not just a specific case, but it seems that standard Chinese is almost never even selected as a possibility, much less the most likely option.
@stormisover points out some weird filtering, could that be the problem? https://github.com/optimaize/language-detector/issues/33#issuecomment-351635876
Other relevant issues:
63 #33
Not too keen on using unicode ranges to detect languages so unfortunately gonna have to stick with Google Translate API for now :(