openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25k stars 3.23k forks source link

Implementation details in few-shot ImageNet evaluation #135

Open TonyLianLong opened 3 years ago

TonyLianLong commented 3 years ago

Thanks for the amazing paper! I have a few questions about the details of the results in Figure 6 in the paper. I tried to use the linear probing code in the README.md on CIFAR to run one-shot evaluation on ImageNet. However, my evaluation implementation, with VIT-B/32, only gets 27% accuracy on ImageNet validation set (not ImageNet V2) instead of about 45% as mentioned in the paper. Therefore, I suspect that I missed some details in the paper experiments and I have a few questions:

  1. What CLIP model is figure 6 using? The zero-shot accuracy makes me think that it's ViT-B/32. However, in my personal evaluation, ViT-B/32 does not provide a 45% one-shot accuracy.
  2. Could you disclose the C parameter in sklearn for this experiment? Although on my end the performance does not vary much in different C values, I will appreciate it if this value could be provided.
  3. How are the instances per class selected? I select the first sample of each class in ImageNet training set, which leads to a dataset size of 1000 and each with feature dimension 768 (which is obtained before the final projection to 512 dimensions). Does my selection approach sound reasonable?

Thanks again for the amazing paper and thanks in advance for helping me.

realTaki commented 3 years ago

I tried run testing on CIFAR100 following by README.md (Zero-Shot Prediction) and also can't achieve the performance in paper. did you solve this problem?

shyammarjit commented 1 year ago

Please refer to https://github.com/openai/CLIP/blob/fcab8b6eb92af684e7ff0a904464be7b99b49b88/notebooks/Prompt_Engineering_for_ImageNet.ipynb for this concern.