Closed DianeBouchacourt closed 1 year ago
NegCLIP uses the same wrapper in our code as the original CLIP repo, which relies on the encode_image
and encode_text
functionalities of both models.
Both open_clip
and the original CLIP repo rely on the same model design, though (see here) only have the text_projector
. So codes are equivalent for all cases.
Not sure what this means for the pseudocode they provide though, I believe the original CLIP repo is a better place to open an issue for this part of the question.
It seems like this one from the original CLIP repo and the second self.proj
that you sent could be the answer, but would be good to clarify with them.
self.proj is indeed the visual projector, i think the library design of openclip follows the original OpenAI CLIP repository, which also had self.proj as the visual projector. I am not sure about VGR, but, generally all zero shot evals are done after projecting (using text projection for text and self.proj for image) and then normalizing the embeddings.
edit: if VGR = visual genome relations, then yes it follows the same standard zero shot protocol in this repo (embeddings --> projection --> normalization --> cosine similarity)
I think the reason to keep self.proj separate is that only vision transformers have a projection layer, whereas resnets don't have it iirc. I am not sure as to why resnets dont have the self.proj
Thanks both for replying. Yes @HarmanDotpy I think we understand each other, self.proj
is the projection layer of the VisionTransformer. It was just less clear than for the text_projector
in the code. So for zero-shot or VGR-type tasks, one uses post-projection for both text and image right?
What is interesting is in the original CLIP paper when they compare zero-shot with a linear head on top, they use pre-projection output of the ViT (see appendix _"For CLIP-ViT models, we used the features before the linear projection to the embedding space, which corresponds to I_f in Figure 3. We train a logistic regression classifier using scikit-learn’s L-BFGS implementation, "). Of course when one wants to compute cossim of text and images, we take the common embedding space (post-projection). But I wonder if for other downstream tasks (e.g. only on the images) one would take pre-projection or not.
Interesting!
In general, using encode_image
/ encode_text
is what we do for all tasks. Nothing is modified, just a call to those fns. Thus yes, in light of your comments, everything is post-projection.
In the original CLIP repo, or also OpenCLIP, they suggest using the encode_image
fn both for 0-shot and linear probing thus seems post-projection. I haven't seen anything other than this behavior in most places; I believe most people reporting results with their repo are most likely using post-projection.
Hi,
I am having trouble understanding the role of the projection W_i and W_t as in CLIP original paper (Fig 3 here https://arxiv.org/pdf/2103.00020.pdf)
I digged into Open_clip code (which you built NegCLIP on), and I found the text projector here https://github.com/mlfoundations/open_clip/blob/d7a5a9595d68287e8ab24797df04d9a79d37faef/src/open_clip/model.py#L228 But no visual projector. Could you point me towards it ? Is it the
self.proj
here https://github.com/mlfoundations/open_clip/blob/74a72f3a4829656a9cfd8bae02253e2d28ab05d1/src/open_clip/transformer.py#L387 The VGR task is done with post-projection features right? When doing zero-shot, you used post-projection features? Could we have the 0 shot code? Thanks !