openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
26.28k stars 3.35k forks source link

Zeroshot evaluation CIFAR100 in Colab: Results ~0.5% below values reported in paper #204

Open andsteing opened 2 years ago

andsteing commented 2 years ago

Hi

Thanks so much for providing this repository and the notebooks!

I'm debugging diffs in the zeroshot evaluation results from a JAX port of this repository (scenic.projects.baselines.clip) and as part of this work I'm trying to reproduce the exact numbers published in the paper https://arxiv.org/abs/2103.00020 (Table 11).

I created a short Colab based on the provided notebooks where I'm zeroshot-evaluating CLIP models on the CIFAR100 dataset: https://colab.research.google.com/github/andsteing/CLIP/blob/zeroshot/notebooks/zeroshot_evaluation.ipynb#scrollTo=Mo-MYo3Flgth

I get the following results:

model dataset 7 prompts 80 prompts table 11
RN50 CIFAR100 40.93 41.04 41.6
B/32 CIFAR100 64.58 64.21 65.1

So the results are still about 0.5% short from what I would expect after reading the paper.

Any idea what this small difference could be due to?

Best, Andreas

jongwook commented 2 years ago

Hi,

There can be numerical differences that we cannot fully control, e.g. different CUDA and driver versions, batch sizes, hardware, etc., that may cause the 0.5% difference in evals.

That being said, have you tried with the 18 prompts in this document?

Calmepro777 commented 1 year ago

Hi,

There can be numerical differences that we cannot fully control, e.g. different CUDA and driver versions, batch sizes, hardware, etc., that may cause the 0.5% difference in evals.

That being said, have you tried with the 18 prompts in this document?

Hello,

I would be grateful if the prompt and other settings related to the zero-shot result on Cifar-100, Cifar-10 can be revealed.

In research, we would like to know if our experimental settings are correct and optimal, otherwise reviewers could challenge this.

Regards

Calmepro777 commented 1 year ago

Here is the best results I obtained:

image encoder: ViT-B/16 prompt: "itap of a {label}."

Dataset Reproduced Acc. Reported Acc. Gap
CIFAR-10 90.51 91.6 1.09
CIFAR-100 68.03 68.7 0.67