Closed HarmanDotpy closed 1 year ago
Hello!
Linear classification is only for cifar and imagenet. Coco and flikr are still used for retrieval.
(See also Section 4, "evaluation" paragraph in the paper)
Sorry for the confusion!
So does this mean for classifications tasks, linear probing is used, while for retrieval tasks, 0 shot results are reported. Is this understanding correct?
Yes, that is correct! We followed the same protocol as in the original CLIP paper. Sorry for the confusion.
thanks, i had a couple of more questions
thanks!
I think all of these are standard, e.g. retrieval and linear probing evaluations can be found and performed similarly everywhere. You can easily refer to one of the VLM papers, e.g. see the CLIP paper Section 3.2, Appendix A, Appendix E, or the original retrieval dataset papers. You can also cross-compare the numbers reported in the paper with the original CLIP/BLIP/X-VLM papers if that helps.
thanks for the references, so looking at CLIP scores for flickr and coco in this paper
vs the scores reported in CLIP paper
there seems to be some difference in these zero shot results. 0.59, 0.78, 0.30, 0.50 vs 68.7, 88, 37.8, 58.4.
oh, I guess the architectures are different...
Yup
Hi,
as per one of my previously opened issues, table6 results are numbers obtained after linear probing on the corresponding datasets (imagenet/cifar/coco/flickr).
in the previous issue, this comment https://github.com/mertyg/vision-language-models-are-bows/issues/4#issuecomment-1442634779 mentioned the linear probe structure for image classification.
I wanted to know the way you finetune the model for coco/flickr. do you have extra linear heads over image/text encoders that you finetune, or the whole model s finetuned for these 2 datasets? and is there a standard way in literature that people use for measuring downstream task performance, as done in Table6?
thanks!