slds-lmu / lecture_advml

Creative Commons Attribution 4.0 International
1 stars 2 forks source link

Update imbalanced intro slides with new benchmark and relevant files #2

Closed Tobias-Brock closed 1 year ago

Tobias-Brock commented 1 year ago

Updating and adding plots for imbalanced intro slides. Adding benchmark to imbalanced intro slides. Adding benchmark R file.

Why are there differences between the ppv calculated by mlr3 measures and the confusion matrix? The tpr is basically the same but the ppv is not. Here is the output in R:


> table
    nr      resample_result        Task             Learner resampling_id iters  Accuracy       TPR       PPV  F1_Score
 1:  1 <ResampleResult[21]> 10000/10000 Logistic Regression            cv     3 0.9198002 0.9189008 0.9205784 0.9197187
 2:  2 <ResampleResult[21]> 10000/10000                 SVM            cv     3 0.9192002 0.9184008 0.9199003 0.9191290
 3:  3 <ResampleResult[21]> 10000/10000 Classification Tree            cv     3 0.9078001 0.9141000 0.9027864 0.9083562
 4:  4 <ResampleResult[21]>  1000/10000 Logistic Regression            cv     3 0.9650004 0.7449815 0.8525444 0.7946861
 5:  5 <ResampleResult[21]>  1000/10000                 SVM            cv     3 0.9637270 0.7069585 0.8705149 0.7796100
 6:  6 <ResampleResult[21]>  1000/10000 Classification Tree            cv     3 0.9600003 0.6939305 0.8447743 0.7587848
 7:  7 <ResampleResult[21]>   100/10000 Logistic Regression            cv     3 0.9924750 0.3888889 0.7546898 0.5030303
 8:  8 <ResampleResult[21]>   100/10000                 SVM            cv     3 0.9924751 0.3190731 0.8250000 0.4545238
 9:  9 <ResampleResult[21]>   100/10000 Classification Tree            cv     3 0.9916832 0.3297683 0.6635178 0.4381271
10: 10 <ResampleResult[21]>    50/10000 Logistic Regression            cv     3 0.9959206 0.3210784 0.7666667 0.4409613
11: 11 <ResampleResult[21]>    50/10000                 SVM            cv     3 0.9958210 0.1813725 0.9166667 0.2987469
12: 12 <ResampleResult[21]>    50/10000 Classification Tree            cv     3 0.9946271 0.3394608 0.4959300 0.3893720

Example: take nr 12:

    nr      resample_result                   Task            Learner               Accuracy       TPR               PPV        F1_Score
12: 12 <ResampleResult[21]>    50/10000  Classification Tree   0.9946271 0.3394608 0.4959300 0.3893720
> table$resample_result[[12]]$prediction()$confusion
           truth
response    1   -1
      1    17   21
      -1    33 9979
> tpr = 17/50
> tpr
[1] 0.34
> ppv = 17/38
> ppv
[1] 0.4473684

ppv from aggregate is 0.4959300. Why is there a difference? tpr and accuracy correspond to the confusion matrix and we have set up the positive class as 1 in the code review as well.

I have not changed any of the calculations except changing cv to threefold to avoid NANs in the plots. Is that okay?