Thanks for your great work.
I am curious about how to get the 1000-dim logits when extracting the CLIP feature for the views.
The original HAMT/DUET uses the ViT pretrained on ImageNet-1K, which has a prediction head for the 1000 classifications, so it concats a 1000-dim vector with the 768 feature. However, since CLIP does not have such a prediction head, how do I get the same 1000 logits? Because I found this is necessary when doing the MRC in pretraining and it is also in the CLIP feature you provided.
Hi Jialu,
Thanks for your great work. I am curious about how to get the 1000-dim logits when extracting the CLIP feature for the views. The original HAMT/DUET uses the ViT pretrained on ImageNet-1K, which has a prediction head for the 1000 classifications, so it concats a 1000-dim vector with the 768 feature. However, since CLIP does not have such a prediction head, how do I get the same 1000 logits? Because I found this is necessary when doing the MRC in pretraining and it is also in the CLIP feature you provided.
Best regards