optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

Text with English and Japanese characters is identified as Galician or Basque #105

Closed adblancod closed 2 years ago

adblancod commented 4 years ago

The following test: BizteX cobitの使い方 is identified as Galician and Basque:

Language [eu]
probability: [0.8410933444100416]
Language [gl]
Probability: [0.13353105016279743]
james-s-w-clark commented 4 years ago

@adblancod can you share a snippet of code you used to get these? I tried with https://github.com/optimaize/language-detector/issues/86#issuecomment-638818158 and got:

detectedLanguages = {ArrayList@1320}  size = 1
 0 = {DetectedLanguage@1325} "DetectedLanguage[eu:0.7259662120805258]"
detectedLanguagesNormalised = {ArrayList@1321}  size = 2
 0 = {DetectedLanguage@1328} "DetectedLanguage[eu:0.8410933444100412]"
 1 = {DetectedLanguage@1329} "DetectedLanguage[gl:0.13353105016279732]"

It looks like you're either manually normalising input, or using some Optimaize method which does normalization for you (which seems very important for CJK, but wasn't happening for me and another user in #86 ).

I don't think I can help with your accuracy though - perhaps the string is just too short for Optimaize. Google Translated detects Slovenian.