optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

misdetection because of break at CONV_THRESHOLD #91

Open danielnaber opened 6 years ago

danielnaber commented 6 years ago

40 introduced this line in detectBlockShortText:

if (Util.normalizeProb(prob) > CONV_THRESHOLD) break;

However, I found a case in which a text with 100 characters that's clearly German is identified as Dutch. This does not happen when I comment out the break (but don't comment out the Util.normalizeProb(prob)).

Code to reproduce: https://gist.github.com/danielnaber/6f738fca065e87a5d067710aabaa1883

Hronom commented 5 years ago

Just for history, in my project JUnit tests was fail because of zero probability(version 0.5). But if you run it as regular run - all probabilities are good. This behaviour happens only under windows(In my case win 10 x64, oracle java 8). Under linux(orcale java 8) - it's good in both cases(regular and JUnit).

Problem solved by upgrading to the version 0.6 and it seems it related to this break.