Open jhwang7628 opened 2 years ago
Hi @jhwang7628 , thanks for your interest.
Hi @dxli94 ,
Thanks for getting back to me.
I am doing some testing and it seems that the full feature [CLS] channel (features.image_embeds[:,0,:]
) does not have on-par performance with the projected features (features.image_embeds_proj[:,0,:]
). Using it for video retrieval task on MSRVTT, the performance is 0.2% compared to 31%. That makes me wonder if I am doing anything suspicious. Any guess?
Hi @jhwang7628 ,
The projected features are normalized, and used to compute contrastive loss. It suits for retrieval purposes.
The [CLS] feature is not normalized. So bad results are expected if you use them to compute similarities directly.
Thanks.
Thanks for the great work. I have some questions about the BLIP feature extractor interface.
In the example code, you wrote
What are the other channels
[:, 1:12, :]
useful for?In the example code of the API, there is another attribute called
image_features
(link), but they are not available. Can you comment on the difference betweenimage_embeds
andimage_features
, and how to access the latter?Thanks!