mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

I can't reproduce Table 6 #29

Closed shuguang99 closed 1 year ago

shuguang99 commented 1 year ago

Hello, thank you for your excellent and creative work. I would like to know whether the performance of Zero Shot classification is reported in Table 6 of the paper. Why did I test the checkpoint of your released model with clip-benchmark and get different results?

I doubt that Table 6 reports the accuracy of Finetune on downstream data sets, is that right?

This is my command: clip_benchmark eval --dataset=imagenet --task=zeroshot_classification --pretrained=./negCLIP.pt --model=ViT-B-32 --output=result.json --batch_size=64 This is the result of comparison:

  Your paper My test
CIFAR10 94.0 85.9
CIFAR100 79.0 60.9
ImageNet 72.0 55.7

Look forward to your reply. Thank you very much

vinid commented 1 year ago

Hello!

Table 6 shows linear probing results, not zero shot classification. See also here

shuguang99 commented 1 year ago

Thank you for your answer. There is another question. Are you sure you can use Batch-size=256 to perform CLIP training on a 2080Ti gpu? Why did I use Batch-size=32 to occupy almost all the memory. This is my command:

CUDA_VISIBLE_DEVICES=0 python -m training.main \
    --train-data=train_neg_clip.tsv \
    --batch-size=256 \
    --epochs=5 \
    --name=negclip_256_1e-6 \
    --lr=1e-6 \
    --val-data=valid_neg_clip.tsv  \
    --logs="./logs/negCLIP/" \
    --pretrained="openai" \
    --model="ViT-B-32"\
    --workers 8 \
    --warmup 50 \
    ---report-to wandb,tensorboard