mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Question regarding numbers in Figure 1 #36

Closed YunYunY closed 5 months ago

YunYunY commented 5 months ago

Dear authors, thank you for sharing the data and code.

According to the paper's description (bottom of page 3), there are only two choices, so the probability should sum to 1 (100%). How did you get the numbers 81% and 78% In Figures 1? Could you please clarify the calculation here? Thank you.

vinid commented 5 months ago

Hello!

BLIP uses an image-text matching loss (see https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) to predict a matching score between an image and a text.

This is a binary classifier and this is the reason why we get score that do not sum to one (you get a value for each pair).