Projections W_i and W_t

mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023

MIT License

222 stars 14 forks source link

Projections W_i and W_t #22

Closed DianeBouchacourt closed 1 year ago

DianeBouchacourt commented 1 year ago

Hi,

I am having trouble understanding the role of the projection W_i and W_t as in CLIP original paper (Fig 3 here https://arxiv.org/pdf/2103.00020.pdf)

I digged into Open_clip code (which you built NegCLIP on), and I found the text projector here https://github.com/mlfoundations/open_clip/blob/d7a5a9595d68287e8ab24797df04d9a79d37faef/src/open_clip/model.py#L228 But no visual projector. Could you point me towards it ? Is it the self.proj here https://github.com/mlfoundations/open_clip/blob/74a72f3a4829656a9cfd8bae02253e2d28ab05d1/src/open_clip/transformer.py#L387 The VGR task is done with post-projection features right? When doing zero-shot, you used post-projection features? Could we have the 0 shot code? Thanks !

mertyg commented 1 year ago

NegCLIP uses the same wrapper in our code as the original CLIP repo, which relies on the encode_image and encode_text functionalities of both models.

Both open_clip and the original CLIP repo rely on the same model design, though (see here) only have the text_projector. So codes are equivalent for all cases.

Not sure what this means for the pseudocode they provide though, I believe the original CLIP repo is a better place to open an issue for this part of the question.

mertyg commented 1 year ago

It seems like this one from the original CLIP repo and the second self.proj that you sent could be the answer, but would be good to clarify with them.

HarmanDotpy commented 1 year ago

self.proj is indeed the visual projector, i think the library design of openclip follows the original OpenAI CLIP repository, which also had self.proj as the visual projector. I am not sure about VGR, but, generally all zero shot evals are done after projecting (using text projection for text and self.proj for image) and then normalizing the embeddings.

edit: if VGR = visual genome relations, then yes it follows the same standard zero shot protocol in this repo (embeddings --> projection --> normalization --> cosine similarity)

HarmanDotpy commented 1 year ago

I think the reason to keep self.proj separate is that only vision transformers have a projection layer, whereas resnets don't have it iirc. I am not sure as to why resnets dont have the self.proj

mertyg commented 1 year ago

Oh neither VGR nor any other dataset is treated differently in the context of retrieval. Of course all use the same protocol; we purposefully provided a single endpoint to get retrieval scores to unify evaluations. See our CLIP Wrapper here or here.

DianeBouchacourt commented 1 year ago

Thanks both for replying. Yes @HarmanDotpy I think we understand each other, self.proj is the projection layer of the VisionTransformer. It was just less clear than for the text_projector in the code. So for zero-shot or VGR-type tasks, one uses post-projection for both text and image right?

What is interesting is in the original CLIP paper when they compare zero-shot with a linear head on top, they use pre-projection output of the ViT (see appendix _"For CLIP-ViT models, we used the features before the linear projection to the embedding space, which corresponds to I_f in Figure 3. We train a logistic regression classifier using scikit-learn’s L-BFGS implementation, "). Of course when one wants to compute cossim of text and images, we take the common embedding space (post-projection). But I wonder if for other downstream tasks (e.g. only on the images) one would take pre-projection or not.

mertyg commented 1 year ago

Interesting!

In general, using encode_image / encode_text is what we do for all tasks. Nothing is modified, just a call to those fns. Thus yes, in light of your comments, everything is post-projection.
In the original CLIP repo, or also OpenCLIP, they suggest using the encode_image fn both for 0-shot and linear probing thus seems post-projection. I haven't seen anything other than this behavior in most places; I believe most people reporting results with their repo are most likely using post-projection.