Linear projection & normalization after encoder

DianCh commented 3 years ago

Hi, in the main paper, before computing the logits and Cross-Entropy loss there are 3 steps:

extract features representations of each modality
linearly project features by W_i and W_t
normalize features

I have two questions regarding this:

First of all, in the model.py the linear projection seems to be wrapped inside the encode_image and encode_text, which contradicts the definition above. So, which tensor should I treat as the right "feature" to use for my own purpose, before the projection or after the projection, and why? Also, why is there no projection in the ResNet visual encoders (only found self.proj in the ViT variant and self.text_proj in the text encoder)?

Secondly, during training both features are normalized before being used to calculate the CE loss, which means the text "hypernet" is classifying images on the unit sphere. However, during linear probing, the code example in README suggests that the classifier is fit on the un-normalized features, which means the linear probe is trying to classify on the free space. Why this discrepancy and which feature should I use for my own downstream tasks, the normalized version or un-normalized version?

Thank you very much! I appreciate any clarification and help.

jongwook commented 3 years ago

Apologies about the confusion - the variables in the pseudocode do not cleanly align with the module boundaries of the PyTorch implementation; the projection matrix for ResNet encoders are inside the output projection of the self-attention pooling layer (as answered in #42 and #51). This was just to reuse nn.functional.multi_head_attention_forward and not write the attention code myself.

In linear probes, we ended up using unnormalized feature vectors before projection as they resulted in slightly better performance in downstream tasks overall. Some justification of using normalized features during training was answered in #68, and you'd have to use them when using CLIP in a multimodal setting (like zero-shot using both encoders), because that's how the image and text feature spaces are connected. On the other hand, the purpose of linear probes was to measure and compare the representation learning capability of the image encoder (discarding the text encoder), so the higher dimensionality before projection probably retained more information useful for linear probes.

So my suggestion for "the right feature" is the normalized ones if you need image and text features living in the same space, or if you're using the vision model only, do some cross validation probing various points between the operations to pick the one that works best for you.

DianCh commented 3 years ago

Thanks for the prompt reply! Your explanation makes total sense to me.

This leads to a follow-up question: I assume this means that, within one modality (i.e., image-image, text-text), the features for measuring similarity should also be normalized version right? If so, I feel that this would make the "feature arithmetic" property suggested in the CLIP follow-up paper:

rather constrained, because it will need to hold on a unit sphere. Is this property verified in some way?

Thank you for sharing!

jongwook commented 3 years ago

Word arithmetic could still make sense if the nearest neighbor search is based on cosine distance - so the equality here would mean that those vectors are close when projected onto the unit sphere.

DianCh commented 3 years ago

Yes, but does that mean the arithmetic should be done after normalizing the features?

jongwook commented 3 years ago

I'd try doing so (i.e. normalizing before and after the arithmetic) first, since it'll then be operating like angular arithmetic. This is not an exact science though, and YMMV.

DianCh commented 3 years ago

Yeah the arithmetic in angular space makes a lot more sense to me. Thank you! I'm closing this for now.

openai / CLIP

Linear projection & normalization after encoder #85