openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
26.28k stars 3.35k forks source link

Reproduce zeroshot results on EuroSAT dataset. #166

Open xcpeng opened 3 years ago

xcpeng commented 3 years ago

Thank you for your work on CLIP!

I was trying to reproduce the zeroshot prediction results listed in Table 11 in the paper. Though I can reproduce most of the results in the Table 11, I found there are huge gaps on EuroSAT dataset.

We have tried:

But we can still can not reproduce the reported numbers in table 11. Any hints will be greatly appreciated, thank you!

Model name ResNet50 ResNet101 RN50x4 RN50x16 ViT-B/16 ViT-B/32
Dataset CLIP Ours Delta CLIP Ours Delta CLIP Ours Delta CLIP Ours Delta CLIP Ours Delta CLIP Ours Delta
EuroSAT 41.1 41.3 0.2 33.1 31.0 -2.1 35.0 32.7 -2.3 40.3 42.0 1.7 54.1 54.6 0.5 49.4 44.8 -4.6

JIT applied or Not when loading CLIP model

Model name ResNet50 ResNet101 RN50x4 RN50x16 ViT-B/16 ViT-B/32
Dataset w/ JIT w/o JIT Delta w/ JIT w/o JIT Delta w/ JIT w/o JIT Delta w/ JIT w/o JIT Delta w/ JIT w/o JIT Delta w/ JIT w/o JIT Delta
EuroSAT 41.3 41.3 0 31.0 31.0 0 32.6 32.7 -0.1 42.2 42.0 0.2 54.6 54.6 0 44.8 44.8 0

Image Preprocessing: Center Crop v.s. No Center Crop

Model name ResNet50 ResNet101 RN50x4 RN50x16 ViT-B/16 ViT-B/32
Dataset w/ Crop w/o Crop Delta w/ Crop w/o Crop Delta w/ Crop w/o Crop Delta w/ Crop w/o Crop Delta w/ Crop w/o Crop Delta w/ Crop w/o Crop Delta
EuroSAT 41.3 41.3 0 31.0 31.0 0 32.7 32.7 0 42.0 42.0 0 54.6 54.6 0 44.8 44.8 0

Order of categories

Dataset Order Inconsistent Order Fixed
EuroSAT Y Y
AnhLee081198 commented 1 year ago

hi @xcpeng, can you share the code that you use to get this performance. Because it's only 4.42 with B/32 with my code. while others dataset gave the same performance as the table 11