Closed DianCh closed 3 years ago
Apologies about the confusion - the variables in the pseudocode do not cleanly align with the module boundaries of the PyTorch implementation; the projection matrix for ResNet encoders are inside the output projection of the self-attention pooling layer (as answered in #42 and #51). This was just to reuse nn.functional.multi_head_attention_forward
and not write the attention code myself.
In linear probes, we ended up using unnormalized feature vectors before projection as they resulted in slightly better performance in downstream tasks overall. Some justification of using normalized features during training was answered in #68, and you'd have to use them when using CLIP in a multimodal setting (like zero-shot using both encoders), because that's how the image and text feature spaces are connected. On the other hand, the purpose of linear probes was to measure and compare the representation learning capability of the image encoder (discarding the text encoder), so the higher dimensionality before projection probably retained more information useful for linear probes.
So my suggestion for "the right feature" is the normalized ones if you need image and text features living in the same space, or if you're using the vision model only, do some cross validation probing various points between the operations to pick the one that works best for you.
Thanks for the prompt reply! Your explanation makes total sense to me.
This leads to a follow-up question: I assume this means that, within one modality (i.e., image-image, text-text), the features for measuring similarity should also be normalized version right? If so, I feel that this would make the "feature arithmetic" property suggested in the CLIP follow-up paper:
rather constrained, because it will need to hold on a unit sphere. Is this property verified in some way?
Thank you for sharing!
Word arithmetic could still make sense if the nearest neighbor search is based on cosine distance - so the equality here would mean that those vectors are close when projected onto the unit sphere.
Yes, but does that mean the arithmetic should be done after normalizing the features?
I'd try doing so (i.e. normalizing before and after the arithmetic) first, since it'll then be operating like angular arithmetic. This is not an exact science though, and YMMV.
Yeah the arithmetic in angular space makes a lot more sense to me. Thank you! I'm closing this for now.
Hi, in the main paper, before computing the logits and Cross-Entropy loss there are 3 steps:
W_i
andW_t
I have two questions regarding this:
First of all, in the
model.py
the linear projection seems to be wrapped inside theencode_image
andencode_text
, which contradicts the definition above. So, which tensor should I treat as the right "feature" to use for my own purpose, before the projection or after the projection, and why? Also, why is there no projection in the ResNet visual encoders (only foundself.proj
in the ViT variant andself.text_proj
in the text encoder)?Secondly, during training both features are normalized before being used to calculate the CE loss, which means the text "hypernet" is classifying images on the unit sphere. However, during linear probing, the code example in README suggests that the classifier is fit on the un-normalized features, which means the linear probe is trying to classify on the free space. Why this discrepancy and which feature should I use for my own downstream tasks, the normalized version or un-normalized version?
Thank you very much! I appreciate any clarification and help.