AOC metric computes AUC instead

rodrigobdz commented 2 years ago

The metric IterativeRemovalOfFeatures should compute AOC but is computing AUC instead—see snippets below as proof.

Bug Description

The value being computed and appended to the list last_results is AUC—see get_auc_score definition. It seems like the line of code commented out contains the correct AOC computation:

# Correct AOC computation
self.last_results.append(1-get_auc_score(preds, np.arange(0, len(preds))))

AUC Definition

Typo aside, being fixed in #112, the docstring should read area under the curve ~(AOC)~ (AUC).

https://github.com/understandable-machine-intelligence-lab/Quantus/blob/3a2f72cc99c3353bf60a36fcf2a3dc0eaa2fbfa3/quantus/metrics/faithfulness_metrics.py#L1434-L1436

annahedstroem commented 2 years ago

Thank you @rodrigobdz for highlighting this

Please see this @Wickstrom as is related to https://github.com/understandable-machine-intelligence-lab/Quantus/pull/105 and leave your comments as you see fit!

Wickstrom commented 2 years ago

I can correct it. In the grand scheme of things, I think AUC or AOC will tell us the same thing, but for one higher is better while for the other lower is better. However, since they use AOC in the paper we should stick to that.

For the calculation itself I think:

self.last_results.append(1-get_auc_score(preds, np.arange(0, len(preds))))

might not be correct, since the auc will not be bounded between 0 and 1. Rather, it depends on the length of preds. I think:

self.last_results.append(len(preds)-get_auc_score(preds, np.arange(0, len(preds))))

works. For instance, consider this example:

import numpy as np
preds = [0.9, 0.7, 0.6, 0.5, 0.2]
print(1-np.trapz(preds)) # negative score
print(len(preds)-np.trapz(preds)) # positive score

annahedstroem commented 2 years ago

Amongst other smaller bug fixes, I'm including this bug fix in this PR https://github.com/understandable-machine-intelligence-lab/Quantus/pull/114 which deals with several issues like these.

@Wickstrom can you confirm my understanding that the AOC calculation would be the following:

self.last_results.append(len(preds)-np.trapz(preds,dx=np.arange(0, len(preds))))

Danke schön!

Wickstrom commented 2 years ago

Yes, I think that should be correct.

rodrigobdz commented 2 years ago

@annahedstroem Independently of the AOC calculation, the function numpy.trapz accepts only float for its dx argument. Besides, the argument x=np.arange(0, len(preds)) can be omitted because it implies dx=1.0 which is the default for that function already.

Assuming the calculation above is indeed correct, the right function call should be:

self.last_results.append(len(preds)-np.trapz(preds))

rodrigobdz commented 2 years ago

@Wickstrom I'm not sure numpy.trapz is the right tool to calculate the AUC score bounded between 0 and 1. In the example provided, the resulting AUC score is outside the [0,1] bounds:

from typing import List
import numpy

preds: List[float] = [0.9, 0.7, 0.6, 0.5, 0.2]

# AUC
print(numpy.trapz(preds))
# Output: 2.35

# AOC
print(1-numpy.trapz(preds)) # negative score
# Output: -1.35

# AOC
print(len(preds)-numpy.trapz(preds)) # positive score
# Output: 2.65

There is a function from sklearn to compute the AUC but, unfortunately, it yields the same result as numpy.trapz—i.e., result is not bounded:

import sklearn.metrics

x: numpy.ndarray = numpy.arange(0, len(preds))

print(sklearn.metrics.auc(x, preds))
# Output: 2.35

An alternative would be to implement AOPC defined in equation 12 in [1] and take its complement to get AUC:

[1] Samek, Wojciech, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller. "Evaluating the visualization of what a deep neural network has learned." IEEE transactions on neural networks and learning systems 28, no. 11 (2016): 2660-2673.

Wickstrom commented 2 years ago

@rodrigobdz AUC is not bounded between 0 and 1, so it is not a problem that we get a score that is larger than 1. My concern was that this formulation:

self.last_results.append(1-get_auc_score(preds, np.arange(0, len(preds))))

assumes it is, so we just needed to modify it a bit. The sklearn.metrics.auc function also uses np.trapz to calculate the integral, which is why they produce the same result. Essentially, we want to compute and integral, also in the Samek et.al. paper. I think using a fast and simple numpy function to compute this integral is the best way to go, but I'm also open for other suggestions.

rodrigobdz commented 2 years ago

@Wickstrom I fully agree with you on using a library to compute it for efficiency reasons. I had a misconception that the AUC score had to be in the range [0,1].

Final question on this topic, is the AOPC score bounded or not—similar to AUC?

I will leave this issue open as it will automatically be closed by https://github.com/understandable-machine-intelligence-lab/Quantus/pull/114.

Wickstrom commented 2 years ago

The AOPC is not bounded, just as the AUC. You can see this for instance in Figure 4 of Samek et.al. that you linked above.

annahedstroem commented 2 years ago

Thanks a lot both

issue is now fixed: https://github.com/understandable-machine-intelligence-lab/Quantus/pull/114#pullrequestreview-943850575.

See line 708 in faithfulness.py.

understandable-machine-intelligence-lab / Quantus