mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Questions on BLIP score computation #15

Closed DianeBouchacourt closed 1 year ago

DianeBouchacourt commented 1 year ago

Hi !

First of all, thanks for the awesome library, I'm using it for a project and it's super helpful to get a view on the different CLIP models.

I have a question regarding the way scores are computed for BLIP: https://github.com/mertyg/vision-language-models-are-bows/blob/09e1fcffea60b7e8f31f93cb0844b344af1d0642/model_zoo/blip_models.py#L178

You seem to be using both the itm_head and the cosine similarity between images and text embeddings. I've been looking in the BLIP Salesforce repo (https://github.com/salesforce/BLIP/tree/main/models) but I don't see such a thing. Could you explain this choice?

mertyg commented 1 year ago

Hey! Thanks for the kind words 🤍 I think this is a great question!

If you look at https://github.com/salesforce/BLIP/blob/main/train_retrieval.py#L134 , this is how they evaluate their models for retrieval; but this is not in the forward function. This was also what gave us the same numbers as the ones reported in the original paper.

Let me know if this clarifies the q!

DianeBouchacourt commented 1 year ago

Thanks ! I thought this was what they did when performing retrieval on the entire dataset (they even state in their paper

To enable faster inference speed,
we follow Li et al. (2021a) and first select k candidates
based on the image-text feature similarity, and then rerank
the selected candidates based on their pairwise ITM scores.

So did you adapted it to your case (which can be seen as retrieval over 2 text captions, not the entire dataset of captions) ?

mertyg commented 1 year ago

Yes, I agree with this, it was confusing to us too as they do not state (that they used a combined score of itm + feature sim in the final stage instead of just itm) in the paper.

In the end, we chose to use the code they released. AFAIR the performances should not be too different in both choices, though I do not have exact numbers now. What I know is these numbers give the numbers reported for retrieval in their paper.

On adaption: So we have 2 modes of operation. If you use run_scores_dataset, it operates in the "retrieval over dataset" mode, which is the same behavior described in the snippet you shared from the BLIP paper. We use this mode to evaluate on COCO/Flickr30k. main_retrieval.py uses this function.

In run_scores_batched, we use the retrieval or K captions mode ("_batched" for the K captions we dub as the batch). main_aro.py uses this function, and this script evaluates the ARO datasets. So yes, we perceived it as retrieval over 2 text captions.