Closed Kunaldawn7 closed 1 year ago
I suspect there is something wrong in your second script (BTW, I'd suggest merging both into one script which would make differences much easier to catch). I just retried the evaluation of the ResNet50_Weights.IMAGENET1K_V2
weights on ImageNet and the accuracy are good:
(pt) β classification git:(main) β torchrun --nproc_per_node=4 train.py --model resnet50 --test-only --weights ResNet50_Weights.IMAGENET1K_V2
...
Test: [ 0/391] eta: 0:31:35 loss: 0.8399 (0.8399) acc1: 100.0000 (100.0000) acc5: 100.0000 (100.0000) time: 4.8471 data: 2.2821 max mem: 533
Test: [100/391] eta: 0:00:27 loss: 1.2759 (1.2416) acc1: 87.5000 (85.5198) acc5: 96.8750 (97.5557) time: 0.0754 data: 0.0611 max mem: 541
Test: [200/391] eta: 0:00:14 loss: 1.7219 (1.3010) acc1: 71.8750 (83.8464) acc5: 90.6250 (96.6884) time: 0.0500 data: 0.0347 max mem: 541
Test: [300/391] eta: 0:00:06 loss: 1.4619 (1.3754) acc1: 75.0000 (81.7795) acc5: 96.8750 (95.7018) time: 0.0505 data: 0.0350 max mem: 541
Test: Total time: 0:00:25
Test: Acc@1 80.850 Acc@5 95.428
I'll close this issue because this is most likely a user issue. @Kunaldawn7 after double-checking and if you're absolutely certain this is a problem with the weights and not with your code, then feel free to re-open. Thanks
@NicolasHug, I believe the issue is not with the IMAGENET1K_V2
weights, but rather the associated transforms.
I am sharing a single file, Inference_v1_v2.py
via a gist showing a clear comparison on the confidence scores while performing inference on the same test image across the IMAGENET1K_V1
and IMAGENET1K_V2
weights.
The Torchvision documentation for IMAGENET1K_V2
states:
The inference transforms are available at ResNet50_Weights.IMAGENET1K_V2.transforms and perform the following preprocessing operations: Accepts PIL.Image, batched (B, C, H, W) and single (C, H, W) image torch.Tensor objects. The images are resized to resize_size=[232] using interpolation=InterpolationMode.BILINEAR, followed by a central crop of crop_size=[224]. Finally the values are first rescaled to [0.0, 1.0] and then normalized using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225].
So even if I use the following transforms on the test image:
transforms_V2 = transforms.Compose([
transforms.Resize(232),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)])
The results are still the same!
I have a notebook where I have performed inference across other test images, and the difference in the results across the V1
and V2
weights are significantly huge.
Please acknowledge. Let me know if I need to open a new query on the same (I can't re-open this issue since I am not a contributor).
Thanks for the simpler reproducing example @Kunaldawn7 . I can reproduce the difference in probability output for that specific image. I think I understand better now what your original concern was.
This is odd since the V2 version was expected to give a better confidence score!
I don't think we should expect better confidence; we should expect better top1 or top5 accuracy (which is the case), but we shouldn't expect the V2 versions to be more confident than V1 in general. And in particular, we shouldn't expect the V2 logits or probabilities to have a better interpretability power than V1.
I think that what you're observing here is that the V1 weights and the V2 weights are calibrated very differently (I mean calibration in that sense). It's an interesting finding and I wasn't aware of it. If I had to bet I would assume it is due to the auto-augmentation routines used in V2 which strongly regularise the model, leading to possibly lower confidence but better accuracy.
But I don't think this is an issue in itself: we don't claim anything w.r.t. confidence scores, the only claims we make concern the accuracies. Hope this helps, I'd be curious to know if having lower confidence scores is an issue for your use-case.
Thanks @NicolasHug for the super quick response. I understand there would undoubtedly be differences in the probability scores with a strongly regularized model. However, if you notice, the probability differences between V1 and V2 are massive!
To ascertain, I ran the inferences across a batch of images for both models. I am giving you a summary of the highest probability scores (as percentages) for each of these models on each image:
Image Filename | V1 | V2 |
---|---|---|
Grosser_Panda.JPG | 99.771% | 58.404% |
boxer_tiger_cat.png | 48.925% | 33.451% |
clownfish.png | 92.497% | 43.483% |
tiger.jpg | 89.430% | 50.925% |
turtle.png | 95.560% | 31.731% |
You can see that there is a vast difference in the values above.
I am planning to present an inference notebook before my audience to show them the difference in the probability scores for V1 and V2 without any training being involved. But, with such differences in the results, I find it difficult to conclude anything.
I am sharing the script of my experiment mentioned above. Please have a look.
NOTE:
Interestingly, I applied the following transforms for V2 instead of the default ones:
transforms_v2 = transforms.Compose([
transforms.Resize(300),
transforms.CenterCrop(176),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)])
I used the train-crop-size (176)
as a part of the validation transforms. The following table shows the results of the highest scores:
Image Filename | V2 |
---|---|
Grosser_Panda.JPG | 77.785% |
boxer_tiger_cat.png | 51.465% |
clownfish.png | 73.963% |
tiger.jpg | 74.962% |
turtle.png | 64.605% |
The results seem to have improved. Do you have any idea why the results aren't consistent enough?
To reiterate my message from above: I don't believe that the confidence score being lower on average is an issue in and of itself. There is not guarantee with respect to how well those classifiers are calibrated. A priori, we cannot conclude anything just from the observation that the predicted probability of one model is higher (or lower) than the same prediction from another model.
π Describe the bug
I performed a simple classification inference on a sample image using ResNet-50. I used both the
IMAGENET1K_V1
andIMAGENET1K_V2
versions of the model weights. I found that there was around 70% jump in the final confidence scores (even after applying the appropriate transforms) in the V1 version as opposed to the V2.This is odd since the V2 version was expected to give a better confidence score!
I have put it in a gist containing the two scripts having the specified versions:
inference_V1.py
andinference_V2.py
respectively. Please have a look!The class with the highest confidence score on the same image across the
IMAGENET1K_V1
andIMAGENET1K_V2
weights (even with the appropriate transforms) are:99.771%
and58.404%
. This is quite odd!The custom transforms for
IMAGENET1K_V1
weights were:The custom transforms for
IMAGENET1K_V2
weights were:Even
ResNet50_Weights.IMAGENET1K_V1.transforms()
andResNet50_Weights.DEFAULT.transforms()
were used with literally no difference in the results!PS: I have also attached the Colab Notebook for reference.
Can someone point out what is the issue here?
Versions