recall and accuracy in MNIST benchmark

Car-la-F commented 1 year ago

@smilesun, @schoersch, @xinyuejohn, @rahulbshrestha,

I took a look at the benchmark plots from MNIST today and realized that recall and accuracy do receive exactly the same values. I tried to check how these metrics are implemented in the code, but as it uses pytorch I do not think there is a bug in the implementation. From my understanding the values for recall and accuracy are not necessarily equal, but maybe I missed something. Does anyone know how this behaviour can be explained?

Car-la-F commented 1 year ago

acc_recall legend

smilesun commented 1 year ago

@Car-la-F , i guess in the training process they are not equal, but after all they are the same?

Car-la-F commented 1 year ago

@smilesun, I do only have results after the training process (i.e. the final results saved after the benchmark). By looking at the .csv files the values for recall and accuracy do coincide in all digits

smilesun commented 1 year ago

From my understanding:

recall for label k = # correct predict-k / # class-label k acc = # correct predict / # all instances

Car-la-F commented 1 year ago

For what I read in the pymetric documentation they additionally average over all classes to get one value instead of one value for each class. A definition like this does make sense from my side, but it does not explain, why the values are exactly the same.

smilesun commented 1 year ago

does this make sense ?

acc = # (correct predict)/#(allinstances) = #(correct predict) / [10 # (class-label 0)] since #(class-label 0)=#(class-label 1)=#(class-label 2)= .... recall = 0.1 \sum{k=0:9) [ #(correct predict-k) / # (class-label k)] = 0.1 \sum_{k=0:9) [ #(correct predict-k) / # (class-label 0)] = 0.1 [ #(correct predict) / # (class-label 0)] = acc?

Car-la-F commented 1 year ago

Oh I see, so the reason for the measures beeing equal is the fact, that all classes in MNIST have the same size (i.e. #(class-label 0)=#(class-label 1)=#(class-label 2)= ....)

I guess that's also the reason why accuracy, precision and recall are linearly dependent on one another.

Thank you

Car-la-F commented 1 year ago

@smilesun

I just received the results from the white blood cell dataset. Here we got the same behaviour for recall and accuracy as with the MNIST dataset.

I did understand that the similarity of recall and accuracy in MNIST was explained by the number of samples per class to be equal, but this is not the case for the white blood cell dataset.

I'm just wondering, if recall and accuracy are always the same, why would one want to use both metrics to measure the performance of an algorithm?

acc_recall

smilesun commented 1 year ago

I think that is something you can report in the "report".
I have been editing the domainlab paper this afternoon.
Coud some of you post a print info during the training and post the output here? Does acc=recall always hold in the training process as well by chooisng a random algorithm but use the same setting as in the benchmark? @Car-la-F @rahulbshrestha @schoersch @xinyuejohn

smilesun commented 1 year ago

This is related to this issue from John https://github.com/marrlab/DomainLab/issues/112

smilesun commented 1 year ago

The changes I made last time to address #112 is here: https://github.com/marrlab/DomainLab/commit/a3bec6f67a44a97f226891e97fcb3da629134a05

smilesun commented 1 year ago

The commit history can be confusing, just look at the current version of DomainLab of master:

https://github.com/marrlab/DomainLab/blob/7ad5a6fd8ce3dac4d031cc53b07b8ab8543150a1/domainlab/utils/perf_metrics.py#L57

I made the changes according to this tutorial:

https://torchmetrics.readthedocs.io/en/v0.10.3/pages/classification.html#input-types

where it said

# Multi-class inputs with probabilities
mc_preds_probs  = torch.tensor([[0.8, 0.2, 0], [0.1, 0.2, 0.7], [0.3, 0.6, 0.1]])
mc_target_probs = torch.tensor([0, 1, 2])

according to


Multi-class with logits or probabilities | (N, C) | float | (N,) | int
-- | -- | -- | -- | --

@Car-la-F @xinyuejohn @schoersch @rahulbshrestha what do you think?

xinyuejohn commented 1 year ago

Coud some of you post a print info during the training and post the output here? Does acc=recall always hold in the training process as well by chooisng a random algorithm but use the same setting as in the benchmark? @Car-la-F @rahulbshrestha @schoersch @xinyuejohn

I just checked acc=recall always hold in the training process.

smilesun commented 1 year ago

Thanks John.

I just find this, shall we change default from micro to Macro ? @RaoUmer

https://github.com/Lightning-AI/lightning/discussions/14608

smilesun commented 1 year ago

here is explanition https://github.com/Lightning-AI/metrics/blob/fe86adfad45cae26cc3688025ba75fa4e6615cb6/src/torchmetrics/functional/classification/accuracy.py#L59

smilesun commented 1 year ago

In this PR, acc still equal recall https://github.com/marrlab/DomainLab/compare/acc_equal_recall?expand=1

smilesun commented 1 year ago

@Car-la-F @xinyuejohn @rahulbshrestha @schoersch , i constructed the following toy example, without DomainLab, it seems the acc still equal to recall.

Have we not used torchmetric correctly? Or the example will lead to acc=recall?

import torch
from torchmetrics.classification import (AUC, AUROC, Accuracy, ConfusionMatrix,
                                         F1Score, Precision, Recall,
                                         Specificity)

class MetricTest():
    def __init__(self, num_classes, average):
        self.acc = Accuracy(num_classes=num_classes, average=average)
        self.precision = Precision(num_classes=num_classes, average=average)
        self.recall = Recall(num_classes=num_classes, average=average)
        self.f1_score = F1Score(num_classes=num_classes, average=average)
        self.auroc = AUROC(num_classes=num_classes, average=average)
        self.specificity = Specificity(num_classes=num_classes,
                                       average=average)
        self.confmat = ConfusionMatrix(num_classes=num_classes)

    def cal_metrics(self, prob, target_label):
        """
        :param model:
        :param loader_te:
        :param device: for final test, GPU can be used
        :param max_batches:
                maximum number of iteration for data loader, used to
                probe performance with less computation burden.
                default None, which means to traverse the whole dataset
        """
        self.acc.reset()
        self.precision.reset()
        self.recall.reset()
        self.f1_score.reset()
        self.auroc.reset()
        self.specificity.reset()
        self.confmat.reset()

        self.acc.update(prob, target_label)
        self.precision.update(prob, target_label)
        self.recall.update(prob, target_label)
        self.specificity.update(prob, target_label)
        self.f1_score.update(prob, target_label)
        self.auroc.update(prob, target_label)
        self.confmat.update(prob, target_label)

        acc_y = self.acc.compute()
        precision_y = self.precision.compute()
        recall_y = self.recall.compute()
        specificity_y = self.specificity.compute()
        f1_score_y = self.f1_score.compute()
        auroc_y = self.auroc.compute()
        confmat_y = self.confmat.compute()
        dict_metric = {"acc": acc_y,
                       "precision": precision_y,
                       "recall": recall_y,
                       "specificity": specificity_y,
                       "f1": f1_score_y,
                       "auroc": auroc_y,
                       "confmat": confmat_y}
        keys = list(dict_metric)
        keys.remove("confmat")
        for key in keys:
            dict_metric[key] = dict_metric[key].cpu().numpy().sum()
        dict_metric["confmat"] = dict_metric["confmat"].cpu().numpy()
        return dict_metric

mc_preds_probs  = torch.tensor([[0.8, 0.2, 0],
                                [0.1, 0.2, 0.7],
                                [0.9, 0.01, 0.09],
                                [0.3, 0.6, 0.1]])

mc_target_probs = torch.tensor([0,
                                0,
                                1,
                                2])

metric_test = MetricTest(3, None)
metric_test.cal_metrics(mc_preds_probs, mc_target_probs)

smilesun commented 1 year ago

to ensure that we don't lose the raw prediction file, a sanity check is made in this issue https://github.com/marrlab/DomainLab/issues/151

smilesun commented 1 year ago

commit ae3ce344fdc9b776143582f764eea7eea57196d7 (HEAD -> master, origin/master, origin/HEAD)

raise error if acc wrong, use different aggregation for acc and recall

smilesun commented 1 year ago

the issue is fixed in master, @Car-la-F @xinyuejohn @schoersch @rahulbshrestha could you check?

Summary:

we have to use "macro" aggregation for recall but "micro" for acc
i added sanity check to ensure the prediciton file

could you re-run the benchmark? I am afraid before, the acc you got is actually recall.

@Car-la-F @xinyuejohn @schoersch @rahulbshrestha

Car-la-F commented 1 year ago

@smilesun Thanks for the update! @schoersch and I started the MNIST benchmark with learning rate $10^{-3}$ again using the same random seeds, so hopefully we get a 1 to 1 comparison with our previous results. We think the benchmark will be finished by today evening.

marrlab / DomainLab

recall and accuracy in MNIST benchmark #145