salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.91k stars 971 forks source link

Blip feature extractor API #39

Open jhwang7628 opened 2 years ago

jhwang7628 commented 2 years ago

Thanks for the great work. I have some questions about the BLIP feature extractor interface.

  1. In the example code, you wrote

    # torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks

    What are the other channels [:, 1:12, :] useful for?

  2. In the example code of the API, there is another attribute called image_features (link), but they are not available. Can you comment on the difference between image_embeds and image_features, and how to access the latter?

    print(features_image.image_embeds.shape)
    print(features_image.image_features.shape)

Thanks!

dxli94 commented 2 years ago

Hi @jhwang7628 , thanks for your interest.

  1. The features with index 0 are of the CLS token. The rest correspond to other text token positions. Although we use [CLS] by default, you may also use others for your applications, e.g. taking mean of them.
  2. It turns out the naming of in-line example code is deprecated. Can you see README for example as of now? In the meantime, we will update the naming accordingly. Re your question, embeddings are directly obtained from feature networks, while embed_proj are normalized embeddings, obtained by further projecting embeddings on the low-dimensional space. embed_proj are useful for calculating feature similarities in a normalized space.
jhwang7628 commented 2 years ago

Hi @dxli94 ,

Thanks for getting back to me.

I am doing some testing and it seems that the full feature [CLS] channel (features.image_embeds[:,0,:]) does not have on-par performance with the projected features (features.image_embeds_proj[:,0,:]). Using it for video retrieval task on MSRVTT, the performance is 0.2% compared to 31%. That makes me wonder if I am doing anything suspicious. Any guess?

dxli94 commented 2 years ago

Hi @jhwang7628 ,

The projected features are normalized, and used to compute contrastive loss. It suits for retrieval purposes.

The [CLS] feature is not normalized. So bad results are expected if you use them to compute similarities directly.

Thanks.