microsoft / FIBER

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
MIT License
126 stars 11 forks source link

Very negative values for matched img-text pairs got from coarse_grained-itm head #19

Open LilyDaytoy opened 10 months ago

LilyDaytoy commented 10 months ago

Hi! Thanks for this wonderful work! I tried to evaluate on flickr30k test set using your coarse_grained-itm approach. I got the score matrix for all image-text pairs from this function: https://github.com/microsoft/FIBER/blob/ca0f36bd7e1ad0ac02af2550042b1f259adaf5f9/coarse_grained/fiber/modules/objectives.py#L389 But I found that the score computed for a matched image-text pair is very negative. For example, like this: score = -6.652344

image

or very small score = 0.048279

image

it is quite wierd? and for example: caption1 = a black boy in orange and white trucks on playing in the sand caption2 = the white dog is running in the shallow water img is

image

This image is clearly matched with caption1, but the score of caption1 is -4.5 and the score for caption2 is -2.4, resulted in this image matcing more with caption2 since the score2 is less negative?

I would like to ask that is it because I got the score wrongly, is it normal score? Or do I need to do some further thing to the score matrix?

zdou0830 commented 10 months ago

Hi, thanks for the question! can you reproduce the evaluation results? the logits are fed to a normalization layer during training and it could be hard to tell if they make sense by just looking at the values https://github.com/microsoft/FIBER/blob/ca0f36bd7e1ad0ac02af2550042b1f259adaf5f9/coarse_grained/fiber/modules/objectives.py#L61C24-L61C24