openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25.04k stars 3.24k forks source link

the acc result on the CIFAR100 dataset #153

Open realTaki opened 3 years ago

realTaki commented 3 years ago

I use the code in the README.md (Zero-Shot Prediction) to test the acc of ViT/B-32 on the CIFAR100 dataset and get the result of about 62% by top1 similarity. It is less than that in the paper (65.1%), I was wondering that how it measures the acc of a zero-shot task in the paper? Can you give some details to help me reproduce the acc result in the paper? or do you have any idea how to troubleshoot this?

jongwook commented 3 years ago

Hi, see https://github.com/openai/CLIP/blob/main/data/prompts.md#cifar100 where you can now find the class names and prompts for ensembling zero-shot predictions for CIFAR100.

weiyx16 commented 2 years ago

I use the code in the README.md (Zero-Shot Prediction) to test the acc of ViT/B-32 on the CIFAR100 dataset and get the result of about 62% by top1 similarity. It is less than that in the paper (65.1%), I was wondering that how it measures the acc of a zero-shot task in the paper? Can you give some details to help me reproduce the acc result in the paper? or do you have any idea how to troubleshoot this?

Hello! Have you try to reproduce the results on voc2007? What I got is only 71% mAP with ViT-B/32 using official class names and prompts, which is way below the reported one: 83.1%. Do you have any suggestions? Really thank you for your help

xcpeng commented 2 years ago

I use the code in the README.md (Zero-Shot Prediction) to test the acc of ViT/B-32 on the CIFAR100 dataset and get the result of about 62% by top1 similarity. It is less than that in the paper (65.1%), I was wondering that how it measures the acc of a zero-shot task in the paper? Can you give some details to help me reproduce the acc result in the paper? or do you have any idea how to troubleshoot this?

Hello! Have you try to reproduce the results on voc2007? What I got is only 71% mAP with ViT-B/32 using official class names and prompts, which is way below the reported one: 83.1%. Do you have any suggestions? Really thank you for your help

Have you checked the order of categories in voc and prompts? FYI, the default order in prompts: classes = [ 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'dog', 'horse', 'motorbike', 'person', 'sheep', 'sofa', 'diningtable', 'pottedplant', 'train', 'tvmonitor', ]

In VOC, the order is: pascalvoc2007_classes = [ 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', ]

Replace the order of prompts with the order in voc, you will get about 82.5 for VIT-B/32

xcpeng commented 2 years ago

The issue is the index of 'dinningtable' in prompts by the way

weiyx16 commented 2 years ago

The issue is the index of 'dinningtable' in prompts by the way

Really thank you for your help!! Actually I have noticed the problem of the prompts and fixed in advance, but I still couldn't reproduce the results. And I need to fix that the point I reproduced is 75%, not 71% mentioned before. The key question is keeping confusing me is how to calculate the 11-point mAP.

Here is how I reproduce the results: https://github.com/weiyx16/vocreproduce/blob/main/reproduce.py Just run reproduce.py and it will download the model and dataset automatically.

weiyx16 commented 2 years ago

The issue is the index of 'dinningtable' in prompts by the way

Really thank you for your help!! Actually I have noticed the problem of the prompts and fixed in advance, but I still couldn't reproduce the results. And I need to fix that the point I reproduced is 75%, not 71% mentioned before. The key question is keeping confusing me is how to calculate the 11-point mAP.

Here is how I reproduce the results: https://github.com/weiyx16/vocreproduce/blob/main/reproduce.py Just run reproduce.py and it will download the model and dataset automatically.

I think the I fix the bug by adding a softmax after logits. This operation will not effect the accuracy in others dataset, but effect the sorting in mAP calculation.

machengcheng2016 commented 2 years ago

@weiyx16 Greetings. Have you ever tried CLIP on the StanfordCars dataset? I can only get a ~47%/~48% acc without/with the "prompt ensembling" trick, far away from ~55% as reported in the original paper. Could you help me with some possible clues, please?

weiyx16 commented 2 years ago

@machengcheng2016 We tried it before, and it's able to reproduce the zero-shot performance on StanfordCars using R50 backbone (about 55.0% in our experiments). Nothing special needs to check. Have you verified the dataset? It has 8041 test images. And have you reproduced results on other datasets?

machengcheng2016 commented 2 years ago

I solved my problem. It is data augmentation that cheats me.

QMiao-cs commented 1 year ago

I use the code in the README.md (Zero-Shot Prediction) to test the acc of ViT/B-32 on the CIFAR100 dataset and get the result of about 62% by top1 similarity. It is less than that in the paper (65.1%), I was wondering that how it measures the acc of a zero-shot task in the paper?

I also test the acc of ViT/B-32 on CIFAR100 and get the result of 61.67%. I would like to ask if you have solved this problem?Can you give some details to reproduce the result?

machengcheng2016 commented 1 year ago

I use the code in the README.md (Zero-Shot Prediction) to test the acc of ViT/B-32 on the CIFAR100 dataset and get the result of about 62% by top1 similarity. It is less than that in the paper (65.1%), I was wondering that how it measures the acc of a zero-shot task in the paper?

I also test the acc of ViT/B-32 on CIFAR100 and get the result of 61.67%. I would like to ask if you have solved this problem?Can you give some details to reproduce the result?

It might come from data augmentation. Please make sure you are using the correct one.

QMiao-cs commented 1 year ago

I use the code in the README.md (Zero-Shot Prediction) to test the acc of ViT/B-32 on the CIFAR100 dataset and get the result of about 62% by top1 similarity. It is less than that in the paper (65.1%), I was wondering that how it measures the acc of a zero-shot task in the paper?

I also test the acc of ViT/B-32 on CIFAR100 and get the result of 61.67%. I would like to ask if you have solved this problem?Can you give some details to reproduce the result?

It might come from data augmentation. Please make sure you are using the correct one.

The images are loaded by the following code in the `README.md' : cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False) image, class_id = cifar100[3637] Is there any data augmentation?

Calmepro777 commented 1 year ago

I use the code in the README.md (Zero-Shot Prediction) to test the acc of ViT/B-32 on the CIFAR100 dataset and get the result of about 62% by top1 similarity. It is less than that in the paper (65.1%), I was wondering that how it measures the acc of a zero-shot task in the paper?

I also test the acc of ViT/B-32 on CIFAR100 and get the result of 61.67%. I would like to ask if you have solved this problem?Can you give some details to reproduce the result?

It might come from data augmentation. Please make sure you are using the correct one.

Hi, I wonder what would be the correct data augmentation setting, I used standard transform settings on validation set of CIFAR100 plus resizing to 224

shyammarjit commented 1 year ago

Please refer to https://github.com/openai/CLIP/blob/fcab8b6eb92af684e7ff0a904464be7b99b49b88/notebooks/Prompt_Engineering_for_ImageNet.ipynb for this concern.

jiachengc commented 5 months ago

Could anyone tell me why "order in prompts" matter? Thanks in advance.