I am glad to read your paper. It gave me tremendous growth.
In the model figure, there are [CLS] tokens as result of text encoder. But if I understand the paper correctly, text encoder is not PLM like BERT but Transformer Encoder. In the code of CLIP, the simple tokenizer has 2 special tokens like below.
I am glad to read your paper. It gave me tremendous growth.
In the model figure, there are [CLS] tokens as result of text encoder. But if I understand the paper correctly, text encoder is not PLM like BERT but Transformer Encoder. In the code of CLIP, the simple tokenizer has 2 special tokens like below.
And, in the paper of CLIP4Clip, the researchers use 'the activations from the highest layer of the transformer at the [EOS] token' as text embedding.
So I wonder what is your text embedding exactly. Since the code hasn't been released yet, I'm asking this. Thanks for reading.