[CLS] token in CLIP-ViP

goonbamm commented 1 year ago

I am glad to read your paper. It gave me tremendous growth.

In the model figure, there are [CLS] tokens as result of text encoder. But if I understand the paper correctly, text encoder is not PLM like BERT but Transformer Encoder. In the code of CLIP, the simple tokenizer has 2 special tokens like below.

vocab.extend(['<|startoftext|>', '<|endoftext|>'])

And, in the paper of CLIP4Clip, the researchers use 'the activations from the highest layer of the transformer at the [EOS] token' as text embedding.

So I wonder what is your text embedding exactly. Since the code hasn't been released yet, I'm asking this. Thanks for reading.

HellwayXue commented 1 year ago

Hi, we also use the same text embedding as in CLIP4Clip. Also, there are still several procedures to go through before release. Please stay tuned.

goonbamm commented 1 year ago

Thanks for your answer. I'll close it.

microsoft / XPretrain

[CLS] token in CLIP-ViP #9