Evaluation bug when using GELU vs QuickGELU -- changes the results for some benchmarks

Hey, I believe there's a bug in the evaluation. NegCLIP code is based on open_clip, which in turn supports both the original ViT-B/32 architecture, which uses QuickGELU (they name it ViT-B-32-quickgelu) and their "standard" one (ViT-B-32).

When you use their code, you need to specify the model (config) and pretrained checkpoint, where the pretrained checkpoint is either a name supported for the given model, or a path. They support "openai" checkpoint for both ViT-B-32 and ViT-B-32-quickgelu, because they hardcode this pretrained checkpoint name to change it to a QuickGELU implementation, regardless of which one of these two was used.

The problem is that ViT-B-32 also seems to have been used for evaluation (by specifying a path to pretrained instead of "openai"). However, this will make QuickGELU not to be used but GELU because the hardcoded if path won't be triggered. And this affects the results. This is an error-prone behavior from open_clip, in my humble opinion. The needed change to fix it would be to use ViT-B-32-quickgelu in the evaluation or use the flag --force-quick-gelu.

How do I know that you ran it this way for evaluation (that you ran into this bug)? Because: When I use GELU, I can reproduce your numbers from Table 6, but when I use QuickGELU, I get different numbers. I'm reproducing the numbers using a fork of open_clip and running my own evaluation of ARO, using the checkpoint you shared.

Numbers for NegCLIP:

	VG-Attribution	VG-Relation	Flickr30k-Order	COCO-Order
Reported results from Table 6	71.0	81.0	91.0	86.0
My evaluation using `ViT-B-32`	70.5	80.1	90.1	87.0
My evaluation using `ViT-B-32-quickgelu`	72.0	81.8	86.4	82.7

On top of this, the zero-shot results on other benchmarks significantly improve after fixing this bug:

	CIFAR10	CIFAR100	ImageNet
My evaluation using `ViT-B-32`	85.9	60.9	55.8
My evaluation using `ViT-B-32-quickgelu`	88.9	62.9	61.0
My evaluation for CLIP	89.8	64.2	63.4

BTW, these results also match those reproduced by someone who commented in OpenReview and by #29.

As we can see, the numbers are much closer to the original pre-trained CLIP numbers when fixing this bug.

For the sake of completeness (and to see that my evaluation is consistent), I also reproduced it for the OpenAI-pretrained CLIP ViT-B/32 in ARO

| | VG-Attribution | VG-Relation | Flickr30k-Order | COCO-Order | | --------------------------------------------------------| ------ | ------ | ------ | ------- | | Reported results from Table 6 for CLIP | 62.0 | 59.0 | 59.0 | 46.0 | | My evaluation for CLIP | 63.4 | 59.9 | 59.2 | 47.7 |

From the author's negclip folk, I found that the negclip model has been fine-tuned upon openai:ViT-B-32 model, where QuickGeLU is applied by default. So, in my opinion, to maintain the consistency between training and evaluation, the negclip model in this repo needs to be initialized with quick_gelu=True during evaluation, which is not set true due to open_clip's behavior.

I also put my own evaluation results on NegCLIP

Model	GELU	VG_rel	VG_attr	Flickr30k_Order	COCO_Order	Avg
NegCLIP-ViT-B-32	False	80.2	70.5	91.0	86.8	82.1
NegCLIP-ViT-B-32	True	81.8	72.1	87.1	82.6	80.9

In addition, aliged to your observation, setting quick_gelu=True is especially critical to the zero-shot recognition performance, for my own results as below.

Model	GELU	imagenet	cifar10	cifar100
NegCLIP-ViT-B-32	False	55.7	85.9	61.2
NegCLIP-ViT-B-32	True	60.9	88.9	63.2

mertyg / vision-language-models-are-bows

Evaluation bug when using GELU vs QuickGELU -- changes the results for some benchmarks #35