Open chasehmathis opened 5 months ago
the class tokens are the first entry so it should actually be image_embeds = [:,0, :]
Hi @chasehmathis, it is correct but our encoder does not have a CLS token. We actually use all tokens to train the CLIP model. We know this is a little redundant and we will push another model with a better approach in the following days. If you use different encoder with CLS token you can use the second approach you sent.
On lines 706-731, I believe there is some issue with the manipulation of tensors handling. If we choose to only select the CLS token then we should not execute this line
enc_image = enc_image.view(enc_image.shape[0], -1)
and instead take the last (cls) tokens of the tensors:image_embeds = [:,-1, :]
I believe would work better