pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.54k stars 619 forks source link

Multi class accuracy metric #1383

Closed erezalg closed 4 years ago

erezalg commented 4 years ago

I really looked at issues and googled the question but couldn't find anything I could use so here goes.

I'm porting some code from vanila pytorch to Ignite, and have a CIFAR10 classifier. Every end of Evaluation epoch, I want to report the per class accuracy. In Ignite, I only found total accuracy (which I use) but not a per class one. I wrote this CustomMetric, but I have a few problems with it:

classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
class CustomAccuracy(Metric):

    def __init__(self, *args, **kwargs):
        self._num_correct = [0] * len(classes)
        self._num_examples = [0] * len(classes)
        super().__init__(*args, **kwargs)
        self.i = 0

    @reinit__is_reduced
    def reset(self):
        self._num_correct = [0] * len(classes)
        self._num_examples = [0] * len(classes)
        self.i = 0
        super(CustomAccuracy, self).reset()

    @reinit__is_reduced
    def update(self, output):
        y_pred, y = output

        for predtensor,real in zip(y_pred,y):
            pred = torch.argmax(F.softmax(predtensor,0), 0)
            self._num_examples[real] += 1
            if real == pred:
                self._num_correct[real] += 1

    @sync_all_reduce("_num_examples", "_num_correct")
    def compute(self):
        class_accuracy = [0] * len(classes)
        for i, (correct, total) in enumerate(zip(self._num_correct,self._num_examples)):
            class_accuracy[i] = correct / total
        #if self._num_examples == 0:
        #    raise NotComputableError('CustomAccuracy must have at least one example before it can be computed.')
        return class_accuracy

First, It depends on the Loss I chose. I use nn.CrossEntropyLoss() which has softmax, so in my metric I also had to add it. But I assume if I use another Loss without softmax I won't need it (and my network will handle that). Second, this feels like something that should be out of the box, so I'm wondering if I'm missing something.

Any advice?

vfdev-5 commented 4 years ago

@erezalg no problems with asking questions here :)

First, It depends on the Loss I chose. I use nn.CrossEntropyLoss() which has softmax, so in my metric I also had to add it. But I assume if I use another Loss without softmax I won't need it (and my network will handle that).

All metrics and custom ones support output_transform which can adapt the output of your network to the input of the metric: logits, probas, thresholded prediction for binary case etc.

So, you can set

def update(self, output):
    ...
    for predtensor,real in zip(y_pred,y):    
         pred = torch.argmax(predtensor, 0)

and setup the metric like

acc_per_class = CustomAccuracy(output_transform=lambda output: F.softmax(output[0], 0), output[1])

Second, this feels like something that should be out of the box, so I'm wondering if I'm missing something.

We have per class out-of-the-box metrics for ignite.metrics.Precision and ignite.metrics.Recall. For ignite.metrics.Accuracy we were inspired from sklearn where there is no such option...

Maybe, another out-of-the-box solution can be to define 10 instancies of ignite.metrics.Accuracy with mapping targets and predictions to a binary case with output_transform.

erezalg commented 4 years ago

Thanks @vfdev-5 ! Your suggestion with the output transform did the trick!

Regarding the suggestion to create 10 instances of an ignite.metrics.Accuracy, This sounds interesting as I can just use existing infra and not add anything else if I get you correctly. I'm not sure how to do this though, how do I tell the metric to look only at specific class's accuracy? Also, How do we define accuracy? let's say my dataset has 500 images with 0 cats and it didn't predict any image has a cat, do I have cat accuracy of 100%?

And if we are talking, I want to ask another question :) I want another custom metric that makes use of input image. let's say every end of epoch I want to report 10 images with their predictions, so I basically need x and y_pred. I followed the example in the documentation, but add output_transform: output_transform=lambda x, y, y_pred: {"x": x, "y": y, "y_pred": y_pred} on all of the evaluator's metrics doesn't work, because as far as I understand, it does it also to my loss function which expects only 2 tensors in output. When I do something like: my_metric = MyMetric(output_transform=lambda x, y, y_pred: {"x": x, "y": y, "y_pred": y_pred}) my_metric.attach(evaluator,'mymetric')

Which from what I gather, will only transform the output for this metric.

But it doesn't work with this error: TypeError: () missing 2 required positional arguments: 'y' and 'y_pred'

Not sure if it's because of my python skills or my ignite skills but I can't figure out how to solve this :) help is appreciated!

Thanks

vfdev-5 commented 4 years ago

Your suggestion with the output transform did the trick!

@erezalg glad that it worked :)

Regarding the suggestion to create 10 instances of an ignite.metrics.Accuracy, This sounds interesting as I can just use existing infra and not add anything else if I get you correctly. I'm not sure how to do this though, how do I tell the metric to look only at specific class's accuracy? Also, How do we define accuracy? let's say my dataset has 500 images with 0 cats and it didn't predict any image has a cat, do I have cat accuracy of 100%?

Accuracy is defined as (TP + TN) / (TP + TN + FP + FN). Accuracy per class will be something like binary accuracy for a single class. Yes, in your example with 0 cats in 500 images and 0 predictions of cat, i'd say the accuracy for predicting cat is 100%. Please, keep in mind that mean of these binary accuracies is not overall accuracy.

Code snippet for 5 classes (easy to check)

from functools import partial
import torch

from ignite.utils import to_onehot
from ignite.engine import Engine
from ignite.metrics import Accuracy

torch.manual_seed(0)
num_classes = 5
batch_size = 4
acc_per_class = {}

def ot_per_class(output, index):
    y_pred, y = output
    # probably, we have to apply torch.sigmoid if output is logits
    y_pred_bin = (y_pred > 0.5).to(torch.long)
    y_ohe = to_onehot(y, num_classes=num_classes)
    return (y_pred_bin[:, index], y_ohe[:, index])

for i in range(num_classes):
    acc_per_class["acc_{}".format(i)] = Accuracy(output_transform=partial(ot_per_class, index=i))

def processing_fn(e, b):
    y_true = torch.randint(0, num_classes, size=(batch_size, ))
    y_preds = torch.rand(batch_size, num_classes)
    print("y_true:", y_true)
    print("y_preds:", (y_preds > 0.5).to(torch.long))
    return y_preds, y_true

engine = Engine(processing_fn)

for n, acc in acc_per_class.items():
    acc.attach(engine, name=n)

engine.run([0, ])
engine.state.metrics

> y_true: tensor([1, 4, 1, 4])
y_preds: tensor([[0, 1, 1, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 1, 1],
        [1, 1, 0, 1, 0]])

{'acc_0': 0.75, 'acc_1': 0.5, 'acc_2': 0.5, 'acc_3': 0.5, 'acc_4': 0.25}

Class 0 was wrongly predicted only once for the last sample and thus accuracy for class 0 is 3.0 / 4.0.

And if we are talking, I want to ask another question :) I want another custom metric that makes use of input image. let's say every end of epoch I want to report 10 images with their predictions, so I basically need x and y_pred. I followed the example in the documentation, but add output_transform: output_transform=lambda x, y, y_pred: {"x": x, "y": y, "y_pred": y_pred} on all of the evaluator's metrics doesn't work, because as far as I understand, it does it also to my loss function which expects only 2 tensors in output. When I do something like: my_metric = MyMetric(output_transform=lambda x, y, y_pred: {"x": x, "y": y, "y_pred": y_pred}) my_metric.attach(evaluator,'mymetric')

Is it something like that you would like to do : https://discuss.pytorch.org/t/how-access-inputs-in-custom-ignite-metric/91221/6 ? Please, let me know if it helps, otherwise a minimal code snippet would be helpful to understand you issue :)

erezalg commented 4 years ago

hi @vfdev-5

Your example is spot on :) I feel like it's a little less intuitive to read (at least from me) than calculating it directly. Anyway, I think we should have these examples somewhere. I think it'd be really nice if people could search for "multi class accuracy" and find concrete examples. Where do you think the best place to put these? I'm not saying that your calculation of class accuracy is wrong (it obviously isn't :) ), I'm just saying that the metric of "cats predicted correctly / total cats in dataset" metric has a great value when analyzing your data! I thought maybe a blog post? or some kb?

And for my second question, here you go:

Code
```python import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim import torchvision.datasets as datasets import torchvision.transforms as transforms from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator from ignite.metrics import Accuracy, Loss from ignite.utils import setup_logger from tqdm import tqdm from trains import Task from torch.utils.tensorboard import SummaryWriter import matplotlib.pyplot as plt task = Task.init(project_name='Image Example', task_name='image classification CIFAR10') configuration_dict = {'number_of_epochs': 20, 'batch_size': 64, 'dropout': 0.25, 'base_lr': 0.001} configuration_dict = task.connect(configuration_dict) # enabling configuration override by trains print(configuration_dict) # printing actual configuration (after override in remote mode) transform = transforms.Compose([transforms.ToTensor()]) trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=configuration_dict.get('batch_size', 4), shuffle=True, num_workers=10) testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) testloader = torch.utils.data.DataLoader(testset, batch_size=configuration_dict.get('batch_size', 4), shuffle=False, num_workers=10) classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck') from ignite.metrics import Metric # These decorators helps with distributed settings from ignite.metrics.metric import sync_all_reduce, reinit__is_reduced class TBReport(Metric): required_output_keys = ("y_pred", "y", "x") def __init__(self, *args, **kwargs): self._num_correct = [0] * len(classes) self._num_examples = [0] * len(classes) super().__init__(*args, **kwargs) self.i = 0 @reinit__is_reduced def reset(self): self._num_correct = [0] * len(classes) self._num_examples = [0] * len(classes) self.i = 0 super(TBReport, self).reset() @reinit__is_reduced def update(self, output): y_pred, y, x = output print(x.size()) @sync_all_reduce("_num_examples", "_num_correct") def compute(self): return 0 class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 3) self.conv2 = nn.Conv2d(6, 16, 3) self.pool = nn.MaxPool2d(2, 2) self.fc1 = nn.Linear(16 * 6 * 6, 120) self.fc2 = nn.Linear(120, 84) self.dorpout = nn.Dropout(p=configuration_dict.get('dropout', 0.25)) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 6 * 6) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(self.dorpout(x)) return x #Training def run(train_batch_size, val_batch_size, epochs, lr, momentum, log_interval): device = "cuda" if torch.cuda.is_available() else "cpu" net = Net().to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=lr, momentum=0.9) trainer = create_supervised_trainer(net, optimizer, criterion, device=device) trainer.logger = setup_logger("trainer") val_metrics = {"accuracy": Accuracy(),"cel": Loss(criterion), "tbrpt": TBReport(output_transform=lambda x, y, y_pred: {"x": x, "y": y, "y_pred": y_pred})} evaluator = create_supervised_evaluator(net, metrics=val_metrics, device=device) evaluator.logger = setup_logger("evaluator") desc = "ITERATION - loss: {:.2f}" pbar = tqdm(initial=0, leave=False, total=len(trainloader), desc=desc.format(0)) @trainer.on(Events.ITERATION_COMPLETED(every=log_interval)) def log_training_loss(engine): pbar.desc = desc.format(engine.state.output) pbar.update(log_interval) @trainer.on(Events.EPOCH_COMPLETED) def log_training_results(engine): pbar.refresh() evaluator.run(trainloader) metrics = evaluator.state.metrics avg_accuracy = metrics["accuracy"] avg_nll = metrics["cel"] tqdm.write( "Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}".format( engine.state.epoch, avg_accuracy, avg_nll ) ) @trainer.on(Events.EPOCH_COMPLETED) def log_validation_results(engine): evaluator.run(testloader) metrics = evaluator.state.metrics avg_accuracy = metrics["accuracy"] avg_nll = metrics["cel"] tqdm.write( "Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}".format( engine.state.epoch, avg_accuracy, avg_nll ) ) pbar.n = pbar.last_print_n = 0 @trainer.on(Events.EPOCH_COMPLETED | Events.COMPLETED) def log_time(engine): tqdm.write( "{} took {} seconds".format(trainer.last_event_name.name, trainer.state.times[trainer.last_event_name.name]) ) trainer.run(trainloader, max_epochs=epochs) pbar.close() PATH = './cifar_net.pth' torch.save(net.state_dict(), PATH) print('Finished Training') print('Task ID number is: {}'.format(task.id)) run(configuration_dict.get('batch_size'), configuration_dict.get('batch_size'), configuration_dict.get('number_of_epochs'), configuration_dict.get('base_lr'), 0.9, 10) ```

Not the most amazingly organized code. Anyway, I borrowed from the thread you pointed me at and tried modifying it to my needs, but it doesn't work and I'm not sure why.

Thanks!!

vfdev-5 commented 4 years ago

I feel like it's a little less intuitive to read (at least from me) than calculating it directly.

@erezalg do you think it would be better to have out-of-the-box solution for that ? Yes, true that we can add this code snippet and some details in our FAQ or somewhere else. We also recommend in README, to find in the issues labeled as question...

'm just saying that the metric of "cats predicted correctly / total cats in dataset" metric has a great value when analyzing your data!

actually, I have an impression that, it is Recall metric per class you'd like. And we have precision/recall per class.

And for my second question, here you go:

The problem is with the output of the evaluator. It should return a dictionary with the keys you'd like to use inside TBReport.

    val_metrics = {
        "accuracy": Accuracy(), 
        "cel": Loss(criterion, output_transform=lambda out_dict: (out_dict["y_pred"], out_dict["y"])),
        "tbrpt": TBReport()}

    evaluator = create_supervised_evaluator(
        net, metrics=val_metrics, device=device,
        output_transform=lambda x, y, y_pred: {"x": x, "y": y, "y_pred": y_pred}
    )

I wonder what would you like to to do inside TBReport metric ? I hope it is not for logging to tensorboard :) Otherwise, please take a look here : https://labs.quansight.org/blog/2020/09/pytorch-ignite/#Common-training-handlers after "It is possible to extend the use of the TensorBoard logger very simply by integrating user-defined functions. For example, here is how to display images and predictions during training:"

erezalg commented 4 years ago

thanks @vfdev-5! As usual you are spot on :) Everything works better than what I was trying to do myself :D

One last question that I couldn't figure out myself is, when I have multiclass recall, the name of the class in the TB graph is the label enumeration. I assume I need some output_transform to change that to the class name, but couldn't figure out how to do it.

Thanks A LOT!!!

vfdev-5 commented 4 years ago

@erezalg thanks for the feedback !

One last question that I couldn't figure out myself is, when I have multiclass recall, the name of the class in the TB graph is the label enumeration.

Unfortunately, it is not possible out-of-the-box to add labes. This is something I was also thinking about to add as a feature request (if you'd like to send one, it could be helpful). The limitation is due to a tensor nature of metrics output: we have for example Recall metric that outputs torch.tensor([0.1, 0.2, 0.3, ..., 0.8]) and this is directly used within OutputHandler for the Tensorboard: https://github.com/pytorch/ignite/blob/75e20420a2391ad2e11ee17df65f781a659ae6ec/ignite/contrib/handlers/tensorboard_logger.py#L290-L291

However, there is a workaround to that. Idea is to create N metrics that output scalars instead of a single metric that gives a tensor. Thus we can label the metric as we'd like. We can use metrics arithmetics for that. Something like that should work:

num_classes = 10
cls_name_mapping = ["car", ...]
val_metrics = {}

for i in range(num_classes):
    cls_name = cls_name_mapping[i]
    val_metrics["Recall/{}".format(cls_name)] = Recall(average=False)[i].item()
erezalg commented 4 years ago

@vfdev-5 That did the trick. Would've been nice if you could give a dict object (mapping label enumeration to class) or just a list of strings but that works too and it's not TOO ugly.

Will most definitely open a feature request. The way it looks, maybe it's better to input the class list directly to the TensorboardLogger object instantiation? Seems like changing the API of metrics is complicated, esp when it's only for TB visualization. If the OutputHandler class gets metric_names, it's not weird to also give metric_classes

vfdev-5 commented 4 years ago

The way it looks, maybe it's better to input the class list directly to the TensorboardLogger object instantiation?

It is not only TB related but all other supported exp tracking systems too. For instance, I do not know where it would be better to add this meta info.

erezalg commented 4 years ago

I see, OK makes sense! I'll open the feature request and let's see where it leads!

Thanks again!

mfoglio commented 3 years ago

I would love to have the possibility of computing accuracy per each class out of the box too :) It would be nice to have the parameter average for Accuracy as provided for Recall and Precision.