Confusion about Confusion Matrix

vgherard commented 3 years ago

I am a bit confused about the normalization of (False/True)-(Positive/Negative) rates output by getConfusionMatrix() for the top N classification task.

I see that the *-Positive frequencies are normalized to the total number of users in the test-set. For instance, with 100 users and a fixed number N of recommendation per user we have:

TP = (# correct recommendations) / 100
FP = (# wrong recommendatons) / 100
TP + FP = N

What about the *-Negative frequencies? How are TN and FN computed? Sorry if this is obvious, but I cannot figure it out.

Thanks in advance,

Valerio

mhahsler commented 3 years ago

The confusion matrix is not normalized, it just contains counts. That is, TP = # correct recommendations

I think you confuse it with the rates like the true positive rate (TPR).

vgherard commented 3 years ago

I agree that that is the standard definition of confusion matrices. However it is not what the output of getConfusionMatrix() looks like, e.g. here TP, FP, TN and FN are evidently a rational number:

library(recommenderlab)
#> Carico il pacchetto richiesto: Matrix
#> Carico il pacchetto richiesto: arules
#> 
#> Attaching package: 'arules'
#> The following objects are masked from 'package:base':
#> 
#>     abbreviate, write
#> Carico il pacchetto richiesto: proxy
#> 
#> Attaching package: 'proxy'
#> The following object is masked from 'package:Matrix':
#> 
#>     as.matrix
#> The following objects are masked from 'package:stats':
#> 
#>     as.dist, dist
#> The following object is masked from 'package:base':
#> 
#>     as.matrix
#> Carico il pacchetto richiesto: registry
#> Registered S3 methods overwritten by 'registry':
#>   method               from 
#>   print.registry_field proxy
#>   print.registry_entry proxy

data("Jester5k")
scheme <- evaluationScheme(Jester5k, 
               method = "split", 
               train = 0.9, 
               given = 15,
               goodRating = 5)
results <- evaluate(scheme, "UBCF", type = "topNList", n = 3)
#> UBCF run fold/sample [model time/prediction time]
#>   1  [0.056sec/2.486sec]
getConfusionMatrix(results)
#> [[1]]
#>      TP    FP     FN     TN precision     recall        TPR        FPR
#> 3 0.548 2.452 14.402 67.598 0.1826667 0.03961105 0.03961105 0.03502547

^{Created on 2021-02-24 by the reprex package (v1.0.0)}

mhahsler commented 3 years ago

Thank you for the code. You are right. This is confusing! It has been a while since I wrote the code. The code calculates a confusion matrix for each test user, and then it averages over the users (byUser defaults to FALSE).

    res <- cbind(TP, FP, FN, TN, precision, recall, TPR, FPR)
    if(!byUser) res <- colMeans(res, na.rm=TRUE)

So the interpretation of what you got is that on average a test user had 0.548 TPs, 2.452 FPs, etc.

Maybe the code should report the sum of TP, FP, FN, TN over all test users instead? So for 100 test users and a top-3 list you would have a N of 300.

For your data you would get:

> getConfusionMatrix(results)
[[1]]
   TP   FP   FN    TN     N precision     recall        TPR        FPR
3 280 1220 6921 34079 42500 0.1866667 0.03888349 0.03888349 0.03456189

The numbers seem to add up The total number of items (N) for the 500 test users is: 500 (ncol(Jester5k) - 15) = 42500 For the top-3 list you make 3 500 = 1500 positive prediction (= TP + FP).

Note: prec, recall, etc. change since they are now calculated over all predictions and not averaged over the user.

I think this plus a description in the man page would be less confusing. What do you think?

vgherard commented 3 years ago

Thank you, this is clarifying.

Yes, I think that an explicit mention in the documentation could be helpful, I was comparing with what you write in the package vignette and I didn't get clear that the results would be averaged over users.

Thank you for the explanations, bests.

Valerio

PS: I probably need a second coffee, but shouldn't recall in your last post be: TP / (TP + TN) = 0.008149248 ?

mhahsler commented 3 years ago

I always confuse these so I had to look this up again. Recall is defined as TP / (TP+FN). This is what I have in the code:

    precision <- TP / (TP + FP)
    recall <- TP / (TP + FN)
    TPR <- recall
    FPR <- FP / (FP + TN)

Thanks for your comments and help! I will update the package and release a fix on CRAN soon.

vgherard commented 3 years ago

Glad to help :-) thanks for the detailed explanations!

mhahsler commented 3 years ago

I reread

Asela Gunawardana and Guy Shani (2009). A Survey of Accuracy Evaluation Metrics of Recommendation Tasks, Journal of Machine Learning Research 10, 2935-2962.

and saw that averaging over test users is the more common approach (compared to summing TP, etc.). I will therefore leave the averaged confusion matrix entries and improve the documentation.

mhahsler / recommenderlab

Confusion about Confusion Matrix #46