mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
261 stars 15 forks source link

dataset size of flickr and coco order datasets #20

Closed HarmanDotpy closed 1 year ago

HarmanDotpy commented 1 year ago

Hi, In the paper it was written (in Fig1) that there are 6,000 test cases in the flickr and coco order datasets. However, when we download and create the data, we get 25010, and 5000 datapoint respectively for coco and flickr. These make sense to me since the total number of captions in the 2 test sets are in fact 25010, 5000 and so perturbing each of them would give us these many number of points. So I was wondering about the 6000 written in the paper.

thank you!

mertyg commented 1 year ago

You are right here I believe; the diff is that there are 5000 images for COCO and 1000 images for Flickr, with 5 captions each.

It makes 30k test cases in total (since each caption is a separate test case), for some reason, we mistakenly put the number of images in the figure

HarmanDotpy commented 1 year ago

i see, thanks. so the results of coco /Flickr order PRC task are on this dataset only is that correct? I am somehow getting the following results:

flickr order negclip released model: 0.70 clip vitb32: 0.48 clip vitb32 - finetuned on coco: 0.35

coco order neglcip released model: 0.66 clip vitb32: 0.38 clip vitb32 - finetuned on coco: 0.26

I am basically calculating the average recall@1 for the image to text retrieval where the texts have the 5 texts corresponding to the image (1 positive text, rest all order shuffled). The relative scores (say clip vitb32 vs clip vitb32 fintuned on coco) are similar to that reported in paper, but the absolute scores are different. I'm wondering if you have an idea. I think my eval is correct since I am using similar eval for different datasets, but I will recheck that as well.

thanks!

mertyg commented 1 year ago

Yes, it is on this dataset only.

I'm not sure about the difference in absolute value. Can try to take a look at your code if you are using something else to see if we spot any difference. (or, ChatGPT is likely better than me at that 😄)

HarmanDotpy commented 1 year ago

ok, finally I know the reason for non reproducibility on my side now, putting it here for anyone else in future.

Say we have an image in the order datasets, for this we will have 5 captions (1 positive all other negative). we calculate the scores for this image with all 5 captions and get an array (for eg.) scores = [a=0.3, b=0.32, c=0.33, d=0.31, e=0.3]

there are many cases in the dataset where two scores are equal (eg a=e). If we do argmax(scores) then we will get [0]. However I was taking some complicate route for this calculation and eventually getting [4] as the armax (which is also correct!)

I am not sure if argmax is the correct thing to do or not. Ideally, we should penalize the model if it has the same score for the positive and negative sentence, however, simply using argmax (as also done in this repo for calculating results) would give us the same results as in the paper