Closed benstigsen closed 1 year ago
Hi @BenStigsen, thank you for your request. Currently, only a relative confidence metric has been implemented because it was easier to do than an absolute one which you are requesting. My goal is to implement such a metric for one of the next releases, so please stay tuned. My spare time is pretty limited at the moment, so it may take a while. The library will be improved, that's a promise.
@BenStigsen Unfortunately, an absolute confidence metric providing real probabilities is not possible to implement because of how the statistical algorithms work. However, I've reworked the computation of the current confidence metric which applies a more reasonable normalization to the statistical values. Now, the values resemble real probabilities more closely.
In your example above, the values would be translated into the following:
// English: 0.99
// French: 0.32
// German: 0.15
// Spanish: 0.01
Also, there is now the following new method that determines the confidence for a single language:
LanguageDetector.ComputeLanguageConfidence(text string, language Language) float64
I will release v1.1.0 soon which includes these improvements. If you have any feedback, I will be happy to read it. Thank you.
Is there any reason why the way I'm currently doing it, wouldn't make it work like percentages? Consider the code below pseudo code.
func sum(arr []float64) float64 {
res := 0
for _, v := range arr {
res += v
}
return res
}
confidenceValues := detector.ComputeLanguageConfidenceValues("some text here")
// English: 0.99
// French: 0.32
// German: 0.15
// Spanish: 0.01
// We normalize the values
total := sum(confidenceValues)
for _, elem := range confidenceValues {
fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value() / total)
}
// English: 0.67
// French: 0.21
// German: 0.10
// Spanish: 0.00
What you are doing there is complete nonsense. You are trying to normalize values which have been normalized already. The values that are returned by the library are supposed to be treated as percentages.
The statistical model in my library is based on a lot of conditional probabilities which are summed up in log space because multiplying the original numbers would result in numerical underflow. You could convert the sums back to linear space by taking the exponent of the number but that would result in extremely small numbers which are not suitable for a confidence metric. That is why I apply min-max normalization to the log numbers which is the perfect fit here to make the values comparable and to provide a confidence metric users can work with.
I do see that the README has this example:
But if I do
detector.ComputeLanguageConfidenceValues("yo bebo ein large quantity of tasty leche")
, English is still going to result in1.0
. How do I get something like a certainty / probability that the text is English? Because1.0
doesn't seem so helpful in that case. It might just be my lack of math experience, I'm assuming this is possible with the values above in the example, but I don't exactly see how.