Closed YunYunY closed 5 months ago
Hello!
BLIP uses an image-text matching loss (see https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) to predict a matching score between an image and a text.
This is a binary classifier and this is the reason why we get score that do not sum to one (you get a value for each pair).
Dear authors, thank you for sharing the data and code.
According to the paper's description (bottom of page 3), there are only two choices, so the probability should sum to 1 (100%). How did you get the numbers 81% and 78% In Figures 1? Could you please clarify the calculation here? Thank you.