mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Similarity scores for NegCLIP are pretty similar #26

Closed kochsebastian closed 1 year ago

kochsebastian commented 1 year ago

When testing NegCLIP, I was able to replicate the results from the paper.

However, I noticed that the similarity scores before computing the argmax are quite similar for the positive and negative classes. This becomes especially obvious when testing NegCLIP with more than two classes.

With vanilla CLIP using softmax, we can obtain a zero-shot classification probability across multiple classes. It is not uncommon to see a probability distribution like [65%, 21%, 10%, 4%].
However, in my experiments using NegCLIP, the output probability distribution for four classes looks more like [26%, 25%, 25%, 24%].

Although the correct class is more often ranked in the top-1 compared to vanilla CLIP, the interpretability of NegCLIP is rather limited.

CLIP has been shown to be capable of distilling knowledge into other domains. I am wondering if this is also possible with the output from NegCLIP.

Has anyone else observed this issue? @authors, is this output intended, or am I doing something wrong?