pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.08k stars 44 forks source link

'compute_language_confidence_values' probabilities do not sum to 1 #119

Closed soloist96 closed 1 year ago

soloist96 commented 1 year ago

I ran a sample

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
confidence_values = detector.compute_language_confidence_values("Cereal Churros Sabor A Canela Kellogg´S 260 Gr")
for language, value in confidence_values:
    print(f"{language.name}: {value:.2f}")

And the output is

SPANISH: 1.00
ENGLISH: 0.96
GERMAN: 0.87
FRENCH: 0.86

The documentation explains that the probability will sum to 1 which makes sense to me. But here, it seems that a binary classification is done and languages are ranked by the binary classification probability. Is there a bug or anything?

Also, if I have less languages to be classified to, does that make the results more accurate?

pemistahl commented 1 year ago

Obviously, you are not using the latest release 1.3.* of the library. I've reworked the computation of confidence scores in this version. With Lingua 1.3.1, your code returns the following probabilities which sum to 1.0.

SPANISH: 0.65
ENGLISH: 0.29
GERMAN: 0.03
FRENCH: 0.03

So just update your dependency and you should be fine.