pytorch / torcheval

A library that contains a rich collection of performant PyTorch model metrics, a simple interface to create new metrics, a toolkit to facilitate metric computation in distributed training and tools for PyTorch model evaluations.
https://pytorch.org/torcheval
Other
211 stars 46 forks source link

Docs return description of binary_confusion_matrix incorrect #183

Open ChrisBog-MV opened 11 months ago

ChrisBog-MV commented 11 months ago

Not sure if I'm being daft here but I think this line is incorrect:

https://github.com/pytorch/torcheval/blob/e0444c6f7cf89e20b5bd17b16dea4e3793810ba9/torcheval/metrics/functional/classification/confusion_matrix.py#L23

It says that the returned tensor contains the values:

( (true positive, false negative) , (false positive, true negative) )

But if you look at the examples, it shows (e.g.):

>>> input = torch.tensor([0, 1, 0.7, 0.6])
>>> target = torch.tensor([0, 1, 1, 0])
>>> binary_confusion_matrix(input, target)
        tensor([[1, 1],
                [0, 2]])

Which for those inputs, I count:

tn = 1 #indices = 0
tp = 2 #indices = 1,2
fp = 1 #indices = 3
fn = 0

So I believe that would mean the actual tensor being returned is either [[fp, tn], [fn, tp]] or [[tn, fp], [fn, tp]]. From my own experiments, I'm pretty sure it's the latter.

Been scratching my head all morning about why my results look wrong and I think this is why.

bobakfb commented 11 months ago

Hi @ChrisBog-MV Thanks for pointing this out!

@JKSenthil I went through and looked at this. The issue is that we are following sklearn's convention of using

TN | FP
FN | TP

which is not standard for binary confusion matrices.

I would suggest we flip the rows and columns here with a bit of slicing

return _confusion_matrix_compute(matrix, normalize)[[1,0]][:, [1,0]]

tests and docstring examples will need to be updated as well. But basically we will then return

TP | FN
FP | TN

which is standard.

anordin95 commented 8 months ago

Hi!

+1. Also, I made an example that illustrates the problem with expected unique counts for each category (i.e. FP, FN, etc.)

confusion_matrix_example.py

import torcheval.metrics
import numpy as np

binary_confusion_matrix_metric = torcheval.metrics.BinaryConfusionMatrix()
# Intentionally construct a dataset that should evaluate to these counts:
# TP: 4; TN: 3; FP: 2; FN: 1.
predictions_and_labels = np.array([
    # False negatives:
    (0, 1),
    # False positives:
    (1, 0),
    (1, 0),
    # True negatives:
    (0, 0),
    (0, 0),
    (0, 0),
    # True positives:
    (1, 1),
    (1, 1),
    (1, 1),
    (1, 1),
])

predictions = predictions_and_labels[: , 0]
labels = predictions_and_labels[: , 1]

binary_confusion_matrix_metric.update(torch.from_numpy(predictions), torch.from_numpy(labels))
binary_confusion_matrix = binary_confusion_matrix_metric.compute()
print(binary_confusion_matrix)

# The torcheval docs indicate the return value should be:
# ( (true positive, false negative) , (false positive, true negative) )
# So, we'd expect ( (4, 1), (2, 3) ).

# However, we don't observe what we expect!
# Instead I think the docs should say:
# ( (true negative, false positive) , (false negative, true positive) )

$ python confusion_matrix_example.py

tensor([[3., 2.],
        [1., 4.]])

PS: a torch.flip(binary_confusion_matrix, dims=[0, 1]) would also do the trick and perhaps be a bit more readable.