hendrycks / natural-adv-examples

A Harder ImageNet Test Set (CVPR 2021)
MIT License
586 stars 51 forks source link

Understand AUPR95 in our paper #10

Closed giangnguyen2412 closed 2 years ago

giangnguyen2412 commented 2 years ago

Hello @hendrycks ,

In your code, you are using the model's confidence scores on two the same imagenet-o datasets. Can you explain why you do this? How do you compute the AUPR95 by two lists of confidence scores. I am trying to improve AUPR95 from your paper but can not grasp how do you get the AUPR here. I did a quick check with two lists of 2000 random floats and the result is given below. What should I expect when I run my program to improve OOD performance.

confidence_in = np.random.rand(10000,)
confidence_out = np.random.rand(10000,)
in_score = -confidence_in
out_score = -confidence_out

aurocs, auprs, fprs = [], [], []
measures = calibration_tools.get_measures(out_score, in_score)
aurocs = measures[0]; auprs = measures[1]; fprs = measures[2];
calibration_tools.print_measures_old(aurocs, auprs, fprs, method_name='MSP')

Output:

FPR95:  94.88
AUROC:  50.90
AUPR:   50.77

Thank you a lot!

hendrycks commented 2 years ago

you are using the model's confidence scores on two the same imagenet-o datasets

We are not. Notice this line: https://github.com/hendrycks/natural-adv-examples/blob/07770705658c3a1c8acce31fd9dbd68f06e297c3/eval_many_models.py#L58

We are comparing ImageNet-O examples to ImageNet val examples.

giangnguyen2412 commented 2 years ago

Hi @hendrycks ,

I got this when I run my code.

FPR95:  100.00
AUROC:  50.97
AUPR:   57.36

What does this imply? False positive rate at recall 95% is 100%. It sounds weird to me. Is this FPR95 value meaningful or by using ImageNet-O from your paper, we just care about AUPR as you reported in Figure 2?

hendrycks commented 2 years ago

We could care about FPR95 or AUROC, but for simplicity we just showed one of the metrics. AUPR and AUROC are more common than FPR95. The model might just have a very hard time detecting these images, hence the low performance.

stsavian commented 1 year ago

Greetings,

I am having a similar issue. I am testing imagenet-o metrics and I get the following values for a ResNet50: FPR95: 80.83 AUROC: 41.78 AUPR: 61.34

The value of AUPR pointed in the paper (table 1 supplementary) is of 16.20%. I have tested the code multiple times with minimal changes (added stable_cumsum / and adjusted the paths L22-24).

Do you know why is this happening? Are you sure that the code provides the correct results for a ResNet50? My guess is it could be due to something happening when creating symlinks to imagenet (lines 54-60).

Thanks! Stefano