On a Bag-of-Words baseline and a transformer

Dear authors,

Thank you very much for the great work!

I am trying to understand, how one could obtain the bag-of-words representations for a caption that are described in Sec 2.3:

we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.

I wonder how this bag-of-words baseline is trained with the transformer. I guess that we could avoid using positional embeddings at the training phase (obviously, we use them during inference), making the activations of the last layer of the transformer at [EOS] token context-free and, therefore, interpreting them as BoW embeddings. Is this what is happening, or are these BoW representations calculated somehow differently?

openai / CLIP

On a Bag-of-Words baseline and a transformer #436