Significant Differences in the Confidence scores across IMAGENET1K_V1 and IMAGENET1K_V2 pre-trained models during Inference

🐛 Describe the bug

I performed a simple classification inference on a sample image using ResNet-50. I used both the IMAGENET1K_V1 and IMAGENET1K_V2 versions of the model weights. I found that there was around 70% jump in the final confidence scores (even after applying the appropriate transforms) in the V1 version as opposed to the V2.

This is odd since the V2 version was expected to give a better confidence score!

I have put it in a gist containing the two scripts having the specified versions: inference_V1.py and inference_V2.py respectively. Please have a look!

The class with the highest confidence score on the same image across the IMAGENET1K_V1 and IMAGENET1K_V2 weights (even with the appropriate transforms) are: 99.771% and 58.404%. This is quite odd!

The custom transforms for IMAGENET1K_V1 weights were:

T.Compose([T.Resize(256),                    
                 T.CenterCrop(224),                
                 T.ToTensor(),                     
                 T.Normalize(                      
                    mean=[0.485, 0.456, 0.406],                
                    std=[0.229, 0.224, 0.225]                  
              )])

The custom transforms for IMAGENET1K_V2 weights were:

T.Compose([T.Resize(232),                    
                 T.CenterCrop(224),                
                 T.ToTensor(),                     
                 T.Normalize(                      
                    mean=[0.485, 0.456, 0.406],                
                    std=[0.229, 0.224, 0.225]                  
              )])

Even ResNet50_Weights.IMAGENET1K_V1.transforms() and ResNet50_Weights.DEFAULT.transforms() were used with literally no difference in the results!

PS: I have also attached the Colab Notebook for reference.

Can someone point out what is the issue here?

Versions

PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-41-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   39 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          20
On-line CPU(s) list:             0-19
Vendor ID:                       GenuineIntel
Model name:                      12th Gen Intel(R) Core(TM) i7-12700H
CPU family:                      6
Model:                           154
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       1
Stepping:                        3
CPU max MHz:                     4700.0000
CPU min MHz:                     400.0000
BogoMIPS:                        5376.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       544 KiB (14 instances)
L1i cache:                       704 KiB (14 instances)
L2 cache:                        11.5 MiB (8 instances)
L3 cache:                        24 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-19
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.0
[pip3] torchinfo==1.7.2
[pip3] torchmetrics==0.11.4
[pip3] torchvision==0.15.0
[pip3] torchviz==0.0.2
[pip3] triton==2.0.0
[conda] blas                      1.0                         mkl  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0           py310h7f8727e_0  
[conda] mkl_fft                   1.3.1           py310hd6ae3a3_0  
[conda] mkl_random                1.2.2           py310h00e6091_0  
[conda] numpy                     1.23.5          py310hd5efca6_0  
[conda] numpy-base                1.23.5          py310h8e6c178_0  
[conda] pytorch                   2.0.0           py3.10_cuda11.7_cudnn8.5.0_0    pytorch
[conda] pytorch-cuda              11.7                 h778d358_3    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.0.0               py310_cu117    pytorch
[conda] torchinfo                 1.7.2                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtriton               2.0.0                     py310    pytorch
[conda] torchvision               0.15.0              py310_cu117    pytorch
[conda] torchviz                  0.0.2                    pypi_0    pypi

I suspect there is something wrong in your second script (BTW, I'd suggest merging both into one script which would make differences much easier to catch). I just retried the evaluation of the ResNet50_Weights.IMAGENET1K_V2 weights on ImageNet and the accuracy are good:

(pt) ➜  classification git:(main) ✗ torchrun --nproc_per_node=4 train.py --model resnet50 --test-only --weights ResNet50_Weights.IMAGENET1K_V2 
...
Test:   [  0/391]  eta: 0:31:35  loss: 0.8399 (0.8399)  acc1: 100.0000 (100.0000)  acc5: 100.0000 (100.0000)  time: 4.8471  data: 2.2821  max mem: 533
Test:   [100/391]  eta: 0:00:27  loss: 1.2759 (1.2416)  acc1: 87.5000 (85.5198)  acc5: 96.8750 (97.5557)  time: 0.0754  data: 0.0611  max mem: 541
Test:   [200/391]  eta: 0:00:14  loss: 1.7219 (1.3010)  acc1: 71.8750 (83.8464)  acc5: 90.6250 (96.6884)  time: 0.0500  data: 0.0347  max mem: 541
Test:   [300/391]  eta: 0:00:06  loss: 1.4619 (1.3754)  acc1: 75.0000 (81.7795)  acc5: 96.8750 (95.7018)  time: 0.0505  data: 0.0350  max mem: 541
Test:  Total time: 0:00:25
Test:  Acc@1 80.850 Acc@5 95.428

I'll close this issue because this is most likely a user issue. @Kunaldawn7 after double-checking and if you're absolutely certain this is a problem with the weights and not with your code, then feel free to re-open. Thanks

@NicolasHug, I believe the issue is not with the IMAGENET1K_V2 weights, but rather the associated transforms.

I am sharing a single file, Inference_v1_v2.py via a gist showing a clear comparison on the confidence scores while performing inference on the same test image across the IMAGENET1K_V1 and IMAGENET1K_V2 weights.

The Torchvision documentation for IMAGENET1K_V2 states:

The inference transforms are available at ResNet50_Weights.IMAGENET1K_V2.transforms and perform the following preprocessing operations: Accepts PIL.Image, batched (B, C, H, W) and single (C, H, W) image torch.Tensor objects. The images are resized to resize_size=[232] using interpolation=InterpolationMode.BILINEAR, followed by a central crop of crop_size=[224]. Finally the values are first rescaled to [0.0, 1.0] and then normalized using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225].

So even if I use the following transforms on the test image:

transforms_V2 = transforms.Compose([            
                                     transforms.Resize(232),                    
                                     transforms.CenterCrop(224),                
                                     transforms.ToTensor(),                     
                                     transforms.Normalize(                      
                                             mean=[0.485, 0.456, 0.406],                
                                             std=[0.229, 0.224, 0.225]                  
                               )])

The results are still the same!

I have a notebook where I have performed inference across other test images, and the difference in the results across the V1 and V2 weights are significantly huge.

Please acknowledge. Let me know if I need to open a new query on the same (I can't re-open this issue since I am not a contributor).

Thanks for the simpler reproducing example @Kunaldawn7 . I can reproduce the difference in probability output for that specific image. I think I understand better now what your original concern was.

This is odd since the V2 version was expected to give a better confidence score!

I don't think we should expect better confidence; we should expect better top1 or top5 accuracy (which is the case), but we shouldn't expect the V2 versions to be more confident than V1 in general. And in particular, we shouldn't expect the V2 logits or probabilities to have a better interpretability power than V1.

I think that what you're observing here is that the V1 weights and the V2 weights are calibrated very differently (I mean calibration in that sense). It's an interesting finding and I wasn't aware of it. If I had to bet I would assume it is due to the auto-augmentation routines used in V2 which strongly regularise the model, leading to possibly lower confidence but better accuracy.

But I don't think this is an issue in itself: we don't claim anything w.r.t. confidence scores, the only claims we make concern the accuracies. Hope this helps, I'd be curious to know if having lower confidence scores is an issue for your use-case.

Thanks @NicolasHug for the super quick response. I understand there would undoubtedly be differences in the probability scores with a strongly regularized model. However, if you notice, the probability differences between V1 and V2 are massive!

To ascertain, I ran the inferences across a batch of images for both models. I am giving you a summary of the highest probability scores (as percentages) for each of these models on each image:

Image Filename	V1	V2
Grosser_Panda.JPG	99.771%	58.404%
boxer_tiger_cat.png	48.925%	33.451%
clownfish.png	92.497%	43.483%
tiger.jpg	89.430%	50.925%
turtle.png	95.560%	31.731%

You can see that there is a vast difference in the values above.

I am planning to present an inference notebook before my audience to show them the difference in the probability scores for V1 and V2 without any training being involved. But, with such differences in the results, I find it difficult to conclude anything.

I am sharing the script of my experiment mentioned above. Please have a look.

NOTE:

Interestingly, I applied the following transforms for V2 instead of the default ones:

transforms_v2 = transforms.Compose([            
                                 transforms.Resize(300),                    
                                 transforms.CenterCrop(176),                
                                 transforms.ToTensor(),                     
                                 transforms.Normalize(                      
                                    mean=[0.485, 0.456, 0.406],                
                                    std=[0.229, 0.224, 0.225]                  
                               )])

I used the train-crop-size (176) as a part of the validation transforms. The following table shows the results of the highest scores:

Image Filename	V2
Grosser_Panda.JPG	77.785%
boxer_tiger_cat.png	51.465%
clownfish.png	73.963%
tiger.jpg	74.962%
turtle.png	64.605%

The results seem to have improved. Do you have any idea why the results aren't consistent enough?

To reiterate my message from above: I don't believe that the confidence score being lower on average is an issue in and of itself. There is not guarantee with respect to how well those classifiers are calibrated. A priori, we cannot conclude anything just from the observation that the predicted probability of one model is higher (or lower) than the same prediction from another model.

pytorch / vision

Significant Differences in the Confidence scores across IMAGENET1K_V1 and IMAGENET1K_V2 pre-trained models during Inference #7570

🐛 Describe the bug

Versions