mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Evaluation bug when using GELU vs QuickGELU -- changes the results for some benchmarks #35

Open bryant1410 opened 7 months ago

bryant1410 commented 7 months ago

Hey, I believe there's a bug in the evaluation. NegCLIP code is based on open_clip, which in turn supports both the original ViT-B/32 architecture, which uses QuickGELU (they name it ViT-B-32-quickgelu) and their "standard" one (ViT-B-32).

When you use their code, you need to specify the model (config) and pretrained checkpoint, where the pretrained checkpoint is either a name supported for the given model, or a path. They support "openai" checkpoint for both ViT-B-32 and ViT-B-32-quickgelu, because they hardcode this pretrained checkpoint name to change it to a QuickGELU implementation, regardless of which one of these two was used.

The problem is that ViT-B-32 also seems to have been used for evaluation (by specifying a path to pretrained instead of "openai"). However, this will make QuickGELU not to be used but GELU because the hardcoded if path won't be triggered. And this affects the results. This is an error-prone behavior from open_clip, in my humble opinion. The needed change to fix it would be to use ViT-B-32-quickgelu in the evaluation or use the flag --force-quick-gelu.

How do I know that you ran it this way for evaluation (that you ran into this bug)? Because: When I use GELU, I can reproduce your numbers from Table 6, but when I use QuickGELU, I get different numbers. I'm reproducing the numbers using a fork of open_clip and running my own evaluation of ARO, using the checkpoint you shared.

Numbers for NegCLIP:

VG-Attribution VG-Relation Flickr30k-Order COCO-Order
Reported results from Table 6 71.0 81.0 91.0 86.0
My evaluation using ViT-B-32 70.5 80.1 90.1 87.0
My evaluation using ViT-B-32-quickgelu 72.0 81.8 86.4 82.7

On top of this, the zero-shot results on other benchmarks significantly improve after fixing this bug:

CIFAR10 CIFAR100 ImageNet
My evaluation using ViT-B-32 85.9 60.9 55.8
My evaluation using ViT-B-32-quickgelu 88.9 62.9 61.0
My evaluation for CLIP 89.8 64.2 63.4

BTW, these results also match those reproduced by someone who commented in OpenReview and by #29.

As we can see, the numbers are much closer to the original pre-trained CLIP numbers when fixing this bug.


For the sake of completeness (and to see that my evaluation is consistent), I also reproduced it for the OpenAI-pretrained CLIP ViT-B/32 in ARO | | VG-Attribution | VG-Relation | Flickr30k-Order | COCO-Order | | --------------------------------------------------------| ------ | ------ | ------ | ------- | | Reported results from Table 6 for CLIP | 62.0 | 59.0 | 59.0 | 46.0 | | My evaluation for CLIP | 63.4 | 59.9 | 59.2 | 47.7 |
ytaek-oh commented 4 months ago

From the author's negclip folk, I found that the negclip model has been fine-tuned upon openai:ViT-B-32 model, where QuickGeLU is applied by default. So, in my opinion, to maintain the consistency between training and evaluation, the negclip model in this repo needs to be initialized with quick_gelu=True during evaluation, which is not set true due to open_clip's behavior.

I also put my own evaluation results on NegCLIP

Model GELU VG_rel VG_attr Flickr30k_Order COCO_Order Avg
NegCLIP-ViT-B-32 False 80.2 70.5 91.0 86.8 82.1
NegCLIP-ViT-B-32 True 81.8 72.1 87.1 82.6 80.9

In addition, aliged to your observation, setting quick_gelu=True is especially critical to the zero-shot recognition performance, for my own results as below.

Model GELU imagenet cifar10 cifar100
NegCLIP-ViT-B-32 False 55.7 85.9 61.2
NegCLIP-ViT-B-32 True 60.9 88.9 63.2