Closed Car-la-F closed 1 year ago
@Car-la-F , i guess in the training process they are not equal, but after all they are the same?
@smilesun, I do only have results after the training process (i.e. the final results saved after the benchmark). By looking at the .csv files the values for recall and accuracy do coincide in all digits
From my understanding:
recall for label k = # correct predict-k / # class-label k acc = # correct predict / # all instances
For what I read in the pymetric documentation they additionally average over all classes to get one value instead of one value for each class. A definition like this does make sense from my side, but it does not explain, why the values are exactly the same.
does this make sense ?
acc = # (correct predict)/#(allinstances) = #(correct predict) / [10 # (class-label 0)] since #(class-label 0)=#(class-label 1)=#(class-label 2)= .... recall = 0.1 \sum{k=0:9) [ #(correct predict-k) / # (class-label k)] = 0.1 \sum_{k=0:9) [ #(correct predict-k) / # (class-label 0)] = 0.1 [ #(correct predict) / # (class-label 0)] = acc?
Oh I see, so the reason for the measures beeing equal is the fact, that all classes in MNIST have the same size (i.e. #(class-label 0)=#(class-label 1)=#(class-label 2)= ....)
I guess that's also the reason why accuracy, precision and recall are linearly dependent on one another.
Thank you
@smilesun
I just received the results from the white blood cell dataset. Here we got the same behaviour for recall and accuracy as with the MNIST dataset.
I did understand that the similarity of recall and accuracy in MNIST was explained by the number of samples per class to be equal, but this is not the case for the white blood cell dataset.
I'm just wondering, if recall and accuracy are always the same, why would one want to use both metrics to measure the performance of an algorithm?
This is related to this issue from John https://github.com/marrlab/DomainLab/issues/112
The changes I made last time to address #112 is here: https://github.com/marrlab/DomainLab/commit/a3bec6f67a44a97f226891e97fcb3da629134a05
The commit history can be confusing, just look at the current version of DomainLab of master:
I made the changes according to this tutorial:
https://torchmetrics.readthedocs.io/en/v0.10.3/pages/classification.html#input-types
where it said
# Multi-class inputs with probabilities
mc_preds_probs = torch.tensor([[0.8, 0.2, 0], [0.1, 0.2, 0.7], [0.3, 0.6, 0.1]])
mc_target_probs = torch.tensor([0, 1, 2])
according to
Multi-class with logits or probabilities | (N, C) | float | (N,) | int
-- | -- | -- | -- | --
@Car-la-F @xinyuejohn @schoersch @rahulbshrestha what do you think?
- Coud some of you post a print info during the training and post the output here? Does acc=recall always hold in the training process as well by chooisng a random algorithm but use the same setting as in the benchmark? @Car-la-F @rahulbshrestha @schoersch @xinyuejohn
I just checked acc=recall always hold in the training process.
Thanks John.
I just find this, shall we change default from micro to Macro ? @RaoUmer
In this PR, acc still equal recall https://github.com/marrlab/DomainLab/compare/acc_equal_recall?expand=1
@Car-la-F @xinyuejohn @rahulbshrestha @schoersch , i constructed the following toy example, without DomainLab, it seems the acc still equal to recall.
Have we not used torchmetric correctly? Or the example will lead to acc=recall?
import torch
from torchmetrics.classification import (AUC, AUROC, Accuracy, ConfusionMatrix,
F1Score, Precision, Recall,
Specificity)
class MetricTest():
def __init__(self, num_classes, average):
self.acc = Accuracy(num_classes=num_classes, average=average)
self.precision = Precision(num_classes=num_classes, average=average)
self.recall = Recall(num_classes=num_classes, average=average)
self.f1_score = F1Score(num_classes=num_classes, average=average)
self.auroc = AUROC(num_classes=num_classes, average=average)
self.specificity = Specificity(num_classes=num_classes,
average=average)
self.confmat = ConfusionMatrix(num_classes=num_classes)
def cal_metrics(self, prob, target_label):
"""
:param model:
:param loader_te:
:param device: for final test, GPU can be used
:param max_batches:
maximum number of iteration for data loader, used to
probe performance with less computation burden.
default None, which means to traverse the whole dataset
"""
self.acc.reset()
self.precision.reset()
self.recall.reset()
self.f1_score.reset()
self.auroc.reset()
self.specificity.reset()
self.confmat.reset()
self.acc.update(prob, target_label)
self.precision.update(prob, target_label)
self.recall.update(prob, target_label)
self.specificity.update(prob, target_label)
self.f1_score.update(prob, target_label)
self.auroc.update(prob, target_label)
self.confmat.update(prob, target_label)
acc_y = self.acc.compute()
precision_y = self.precision.compute()
recall_y = self.recall.compute()
specificity_y = self.specificity.compute()
f1_score_y = self.f1_score.compute()
auroc_y = self.auroc.compute()
confmat_y = self.confmat.compute()
dict_metric = {"acc": acc_y,
"precision": precision_y,
"recall": recall_y,
"specificity": specificity_y,
"f1": f1_score_y,
"auroc": auroc_y,
"confmat": confmat_y}
keys = list(dict_metric)
keys.remove("confmat")
for key in keys:
dict_metric[key] = dict_metric[key].cpu().numpy().sum()
dict_metric["confmat"] = dict_metric["confmat"].cpu().numpy()
return dict_metric
mc_preds_probs = torch.tensor([[0.8, 0.2, 0],
[0.1, 0.2, 0.7],
[0.9, 0.01, 0.09],
[0.3, 0.6, 0.1]])
mc_target_probs = torch.tensor([0,
0,
1,
2])
metric_test = MetricTest(3, None)
metric_test.cal_metrics(mc_preds_probs, mc_target_probs)
to ensure that we don't lose the raw prediction file, a sanity check is made in this issue https://github.com/marrlab/DomainLab/issues/151
commit ae3ce344fdc9b776143582f764eea7eea57196d7 (HEAD -> master, origin/master, origin/HEAD)
raise error if acc wrong, use different aggregation for acc and recall
the issue is fixed in master, @Car-la-F @xinyuejohn @schoersch @rahulbshrestha could you check?
Summary:
could you re-run the benchmark? I am afraid before, the acc you got is actually recall.
@Car-la-F @xinyuejohn @schoersch @rahulbshrestha
@smilesun Thanks for the update! @schoersch and I started the MNIST benchmark with learning rate $10^{-3}$ again using the same random seeds, so hopefully we get a 1 to 1 comparison with our previous results. We think the benchmark will be finished by today evening.
@smilesun, @schoersch, @xinyuejohn, @rahulbshrestha,
I took a look at the benchmark plots from MNIST today and realized that recall and accuracy do receive exactly the same values. I tried to check how these metrics are implemented in the code, but as it uses pytorch I do not think there is a bug in the implementation. From my understanding the values for recall and accuracy are not necessarily equal, but maybe I missed something. Does anyone know how this behaviour can be explained?