A question on results from Figure 2 and bag-or-wordness

Dear authors,

Thank you very much for the great work!

In Intriguing Properties of Vision Transformers (Sec. 3.3) the authors show that the positional encoding is not crucial for ViT, making training ViT on permuted patches possible. Therefore, it does seem reasonable that the result of the retrieval tasks after perturbing vision tokens degrades, but does not become completely random.

On the other hand, if I understand it correctly, in the CLIP paper the authors train their model using bag-of-words representations (Sec. 2.3):

[...] we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.

Although the authors of CLIP claim that they train their model on bag-of-words embeddings (even though they have never shared their training code, if I am not mistaken), it seems that they use "usual", contextual representations while inference. From that, it is also not very surprising that the results of the retrieval tasks without word order do not match with those when the word order is preserved, but also do not degrade completely.

It is more than possible, that it is me, lacking some understanding, but it is not clear why we should expect behaviour different from behaviour of bag-of-words models for both modalities on retrieval tasks. It seems that the text backbone was trained on BoW representations, at the same time ViT is robust to patch perturbations making it similar to the vision BoW model. From that, arguing with what you have in Sec. 3. 2

We thus argue that it is unclear what should incentivize models trained with a contrastive loss to learn to pay attention to order structure [...]

I would say that it is a combination of CLIP's training strategy and ViT's inductive bias that does not pay attention to the order structure.

I would be very grateful if you could explain to me what I got wrong here. Thank you very much and look forward to your answer!

mertyg / vision-language-models-are-bows

A question on results from Figure 2 and bag-or-wordness #38