Closed vgherard closed 3 years ago
The confusion matrix is not normalized, it just contains counts. That is, TP = # correct recommendations
I think you confuse it with the rates like the true positive rate (TPR).
I agree that that is the standard definition of confusion matrices. However it is not what the output of getConfusionMatrix()
looks like, e.g. here TP
, FP
, TN
and FN
are evidently a rational number:
library(recommenderlab)
#> Carico il pacchetto richiesto: Matrix
#> Carico il pacchetto richiesto: arules
#>
#> Attaching package: 'arules'
#> The following objects are masked from 'package:base':
#>
#> abbreviate, write
#> Carico il pacchetto richiesto: proxy
#>
#> Attaching package: 'proxy'
#> The following object is masked from 'package:Matrix':
#>
#> as.matrix
#> The following objects are masked from 'package:stats':
#>
#> as.dist, dist
#> The following object is masked from 'package:base':
#>
#> as.matrix
#> Carico il pacchetto richiesto: registry
#> Registered S3 methods overwritten by 'registry':
#> method from
#> print.registry_field proxy
#> print.registry_entry proxy
data("Jester5k")
scheme <- evaluationScheme(Jester5k,
method = "split",
train = 0.9,
given = 15,
goodRating = 5)
results <- evaluate(scheme, "UBCF", type = "topNList", n = 3)
#> UBCF run fold/sample [model time/prediction time]
#> 1 [0.056sec/2.486sec]
getConfusionMatrix(results)
#> [[1]]
#> TP FP FN TN precision recall TPR FPR
#> 3 0.548 2.452 14.402 67.598 0.1826667 0.03961105 0.03961105 0.03502547
Created on 2021-02-24 by the reprex package (v1.0.0)
Thank you for the code. You are right. This is confusing! It has been a while since I wrote the code. The code calculates a confusion matrix for each test user, and then it averages over the users (byUser
defaults to FALSE
).
res <- cbind(TP, FP, FN, TN, precision, recall, TPR, FPR)
if(!byUser) res <- colMeans(res, na.rm=TRUE)
So the interpretation of what you got is that on average a test user had 0.548 TPs, 2.452 FPs, etc.
Maybe the code should report the sum of TP, FP, FN, TN over all test users instead? So for 100 test users and a top-3 list you would have a N of 300.
For your data you would get:
> getConfusionMatrix(results)
[[1]]
TP FP FN TN N precision recall TPR FPR
3 280 1220 6921 34079 42500 0.1866667 0.03888349 0.03888349 0.03456189
The numbers seem to add up The total number of items (N) for the 500 test users is: 500 (ncol(Jester5k) - 15) = 42500 For the top-3 list you make 3 500 = 1500 positive prediction (= TP + FP).
Note: prec, recall, etc. change since they are now calculated over all predictions and not averaged over the user.
I think this plus a description in the man page would be less confusing. What do you think?
Thank you, this is clarifying.
Yes, I think that an explicit mention in the documentation could be helpful, I was comparing with what you write in the package vignette and I didn't get clear that the results would be averaged over users.
Thank you for the explanations, bests.
Valerio
PS: I probably need a second coffee, but shouldn't recall in your last post be: TP / (TP + TN) = 0.008149248
?
I always confuse these so I had to look this up again. Recall is defined as TP / (TP+FN). This is what I have in the code:
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
TPR <- recall
FPR <- FP / (FP + TN)
Thanks for your comments and help! I will update the package and release a fix on CRAN soon.
Glad to help :-) thanks for the detailed explanations!
I reread
Asela Gunawardana and Guy Shani (2009). A Survey of Accuracy Evaluation Metrics of Recommendation Tasks, Journal of Machine Learning Research 10, 2935-2962.
and saw that averaging over test users is the more common approach (compared to summing TP, etc.). I will therefore leave the averaged confusion matrix entries and improve the documentation.
I am a bit confused about the normalization of (False/True)-(Positive/Negative) rates output by
getConfusionMatrix()
for the top N classification task.I see that the *-Positive frequencies are normalized to the total number of users in the test-set. For instance, with 100 users and a fixed number N of recommendation per user we have:
What about the *-Negative frequencies? How are TN and FN computed? Sorry if this is obvious, but I cannot figure it out.
Thanks in advance,
Valerio