openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
26.1k stars 3.33k forks source link

On a Bag-of-Words baseline and a transformer #436

Open iburenko opened 7 months ago

iburenko commented 7 months ago

Dear authors,

Thank you very much for the great work!

I am trying to understand, how one could obtain the bag-of-words representations for a caption that are described in Sec 2.3:

we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.

I wonder how this bag-of-words baseline is trained with the transformer. I guess that we could avoid using positional embeddings at the training phase (obviously, we use them during inference), making the activations of the last layer of the transformer at [EOS] token context-free and, therefore, interpreting them as BoW embeddings. Is this what is happening, or are these BoW representations calculated somehow differently?