mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

Table 6, COCO and Flickr Image/Text R@1 results #18

Closed HarmanDotpy closed 1 year ago

HarmanDotpy commented 1 year ago

Hi,

as per one of my previously opened issues, table6 results are numbers obtained after linear probing on the corresponding datasets (imagenet/cifar/coco/flickr).

in the previous issue, this comment https://github.com/mertyg/vision-language-models-are-bows/issues/4#issuecomment-1442634779 mentioned the linear probe structure for image classification.

I wanted to know the way you finetune the model for coco/flickr. do you have extra linear heads over image/text encoders that you finetune, or the whole model s finetuned for these 2 datasets? and is there a standard way in literature that people use for measuring downstream task performance, as done in Table6?

thanks!

vinid commented 1 year ago

Hello!

Linear classification is only for cifar and imagenet. Coco and flikr are still used for retrieval.

(See also Section 4, "evaluation" paragraph in the paper)

Sorry for the confusion!

HarmanDotpy commented 1 year ago

So does this mean for classifications tasks, linear probing is used, while for retrieval tasks, 0 shot results are reported. Is this understanding correct?

mertyg commented 1 year ago

Yes, that is correct! We followed the same protocol as in the original CLIP paper. Sorry for the confusion.

HarmanDotpy commented 1 year ago

thanks, i had a couple of more questions

  1. how is the image2text retrieval score calculated? in the test set of COCO/flickr there are 5 captions/image. So one way of doing this that I have seen in some papers is that if say you are calculating recall@1 and the 1st retrieved sentence is 1 of the 5 positive sentences, then you give a positive score to the model. But I think this paper uses some other way, since the text R@1 scores are ~0.5 for CLIP, while (if I'm not wrong) it probably should be higher if the method I described above is used.
  2. Is there a particular reason for finetuning for image classification, vs zero shot for retrieval for table 6? is it motivated by some previous literaure/way people do it in previous papers?

thanks!

mertyg commented 1 year ago

I think all of these are standard, e.g. retrieval and linear probing evaluations can be found and performed similarly everywhere. You can easily refer to one of the VLM papers, e.g. see the CLIP paper Section 3.2, Appendix A, Appendix E, or the original retrieval dataset papers. You can also cross-compare the numbers reported in the paper with the original CLIP/BLIP/X-VLM papers if that helps.

HarmanDotpy commented 1 year ago

thanks for the references, so looking at CLIP scores for flickr and coco in this paper

Screenshot 2023-04-13 at 3 34 26 PM

vs the scores reported in CLIP paper

Screenshot 2023-04-13 at 3 34 55 PM

there seems to be some difference in these zero shot results. 0.59, 0.78, 0.30, 0.50 vs 68.7, 88, 37.8, 58.4.

HarmanDotpy commented 1 year ago

oh, I guess the architectures are different...

mertyg commented 1 year ago

Yup