Add absolute confidence metric

pemistahl / lingua-go

The most accurate natural language detection library for Go, suitable for short text and mixed-language text

Apache License 2.0

1.19k stars 66 forks source link

Add absolute confidence metric #16

Closed benstigsen closed 1 year ago

benstigsen commented 2 years ago

I do see that the README has this example:

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    confidenceValues := detector.ComputeLanguageConfidenceValues("languages are awesome")

    for _, elem := range confidenceValues {
        fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
    }

    // Output:
    // English: 1.00
    // French: 0.79
    // German: 0.75
    // Spanish: 0.72
}

But if I do detector.ComputeLanguageConfidenceValues("yo bebo ein large quantity of tasty leche"), English is still going to result in 1.0. How do I get something like a certainty / probability that the text is English? Because 1.0 doesn't seem so helpful in that case. It might just be my lack of math experience, I'm assuming this is possible with the values above in the example, but I don't exactly see how.

pemistahl commented 2 years ago

Hi @BenStigsen, thank you for your request. Currently, only a relative confidence metric has been implemented because it was easier to do than an absolute one which you are requesting. My goal is to implement such a metric for one of the next releases, so please stay tuned. My spare time is pretty limited at the moment, so it may take a while. The library will be improved, that's a promise.

pemistahl commented 1 year ago

@BenStigsen Unfortunately, an absolute confidence metric providing real probabilities is not possible to implement because of how the statistical algorithms work. However, I've reworked the computation of the current confidence metric which applies a more reasonable normalization to the statistical values. Now, the values resemble real probabilities more closely.

In your example above, the values would be translated into the following:

// English: 0.99
// French: 0.32
// German: 0.15
// Spanish: 0.01

Also, there is now the following new method that determines the confidence for a single language:

LanguageDetector.ComputeLanguageConfidence(text string, language Language) float64

I will release v1.1.0 soon which includes these improvements. If you have any feedback, I will be happy to read it. Thank you.

benstigsen commented 1 year ago

Is there any reason why the way I'm currently doing it, wouldn't make it work like percentages? Consider the code below pseudo code.

func sum(arr []float64) float64 {
   res := 0
   for _, v := range arr {
      res += v
   }
   return res
}

confidenceValues := detector.ComputeLanguageConfidenceValues("some text here")
// English: 0.99
// French: 0.32
// German: 0.15
// Spanish: 0.01

// We normalize the values
total := sum(confidenceValues)
for _, elem := range confidenceValues {
  fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value() / total)
}
// English: 0.67
// French: 0.21
// German: 0.10
// Spanish: 0.00

pemistahl commented 1 year ago

What you are doing there is complete nonsense. You are trying to normalize values which have been normalized already. The values that are returned by the library are supposed to be treated as percentages.

The statistical model in my library is based on a lot of conditional probabilities which are summed up in log space because multiplying the original numbers would result in numerical underflow. You could convert the sums back to linear space by taking the exponent of the number but that would result in extremely small numbers which are not suitable for a confidence metric. That is why I apply min-max normalization to the log numbers which is the perfect fit here to make the values comparable and to provide a confidence metric users can work with.