Fixing a severe MISTAKE in calculating the accuracy using ImageNet-A

HengyueL commented 7 months ago

The eval.py example provided along with the release of ImageNet-A dataset has a severe calculation mistake in calculating the ImageNet-A accuracy given a ImageNet-1K pretrained model.

Cause: ImageNet-1K has 1,000 labelled classes, whereas ImageNet-A has only 200 labelled classes (which is a subset defined by variable "thousands_k_to_200".) Then the test model is trained for ImageNet-1K dataset, the old version determines the prediction of such model in a wrong way, causing the final ImageNet-A classification acc. to be over-estimated.

Reason: I believe the goal of ImageNet-A dataset is to test the robustness of the original ImageNet-1K model, which means that we apply argmax rule to determine the pretrained model prediction, we should not assume the subset of 200 possible labels are known to the model and the prediction is determined by a subset of 200 logits. Instead, we should consider all 1000 logits and check if the argmax rule leads to the correct one in the label subset.

Approach: Instead of using the pipeline provided as below: output = net(data)[: ,indices_in_1k] pred = output.data.max(1)[1] correct = pred.eq(target)

We should compute the result as follows (pseudo concept below, see pull request for actual implementation): output = net(data) pred = torch.argmax(output, dim=1)[MAP_TO_200_SUBCLASS] correct = pred.eq(target)

Result: The old eval.py will significantly over-estimate the robust accuracy. For example, in Table 6 of DINO V2 paper: https://arxiv.org/pdf/2304.07193.pdf, where IN-A acc. is reported to be 75.9 using the old eval.py protocol; if this model is evaluated using the correct version, this number is reduced to ~ 53%.

Impact: Considering the amount of citation of the original paper, I think this issue needs to be broadcasted and clarified in the community and let researchers be aware that all previous claims using ImageNet-A evaluation are very likely over-optimistic.

zhulinchng commented 5 months ago

@HengyueL I think because ImageNet-A only has 200 classes, it would be better to compare only against the relevant classes predicted from the standard Classifiers trained on the 1000 classes.

I believe your version of the evaluation would be best use in the case of:

The classifiers were trained only on the 200 classes from the ImageNet-1K; or
The ImageNet-A dataset has the same 1000 classes as the ImageNet-1K.

xksteven commented 3 months ago

Thanks for the PR but @zhulinchng statements highlight our original thinking in that we didn't want to bias the models or expect them to be calibrated for OOD images. In this way we simply grade their performance against our 200 class subset.

Both versions of evaluating can be viewed as being "correct" and instead measuring slightly different things. We have had discussions with other groups who choose to evaluate it in the way you're proposing.

Thanks for the PR but we will not have it merged in.

hendrycks / natural-adv-examples

Fixing a severe MISTAKE in calculating the accuracy using ImageNet-A #15