I have noticed that the results of baseline CLIP in the paper Table 1 are very high compared to what I have realized. Specifically, I use the exactly same dataset in this repo. I cannot find the details of how you implement this CLIP method in your paper. Do you just concatenate multimodal CLIP features and calculates the similarity? Or add another small model like MLP after concatenation by CLIP and train the model to get such results? Could you provide more details of this? Thank you so much!
Hi Pengfei,
I have noticed that the results of baseline CLIP in the paper Table 1 are very high compared to what I have realized. Specifically, I use the exactly same dataset in this repo. I cannot find the details of how you implement this CLIP method in your paper. Do you just concatenate multimodal CLIP features and calculates the similarity? Or add another small model like MLP after concatenation by CLIP and train the model to get such results? Could you provide more details of this? Thank you so much!