mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
250 stars 15 forks source link

Questions on evaluation results #33

Open ytaek-oh opened 10 months ago

ytaek-oh commented 10 months ago

Dear Authors,

Firstly, I appreciate for your outstanding project. I've been playing with your benchmark using the provided codebase, and noticed some discrepancies between the evaluation results I obtained and those reported in your paper.

Below is a comparative overview table of my evaluation results on the VG datasets and PRC tasks using COCO_order and Flickr30k_order:

model pretrained vg_relation vg_attribution coco_order flickr30k_order Task Avg.
ViT-B-32 openai 59.9% 63.2% 47.4% 58.8% 57.3%
NegCLIP coco ft 80.2% 70.5% 86.8% 89.7% 81.8%
BLIP-base flickr ft 49.7% 89.9% 42.5% 40.5% 55.7%
BLIP-base coco ft 58.4% 89.5% 37.1% 46.3% 57.8%

I observed that my reproduced results for VG_Relation and VG_Attribution closely align with the numbers presented in your paper. However, I have concerns regarding the NegCLIP results for flickr30k_order, where a 91% (0.91) is reported in your paper (Appendix Table 6).

In addition, regarding the BLIP models, there seems to be a somewhat higher discrepancy. In your paper from Appendix Table 5, the results reported are 0.369 for Flickr30k-PRC (BLIP-flickr-base) and 0.321 for COCO-PRC (BLIP-coco-base). In contrast, my results showed significantly higher scores of 40.5% and 37.1%, respectively, for the same models.

Note 1: I observed that some level of randomness arises when creating order annotations from the original annotation file. However, this randomness does not seem to cause the large gap observed.

Note 2: To account for the randomness, I maintained the same order annotations across different models in my experiments.

Given that these results were obtained using the provided codes and checkpoints, I would like to see any potential my mistakes or what was gone wrong by any reason.

Best regards,

Gavin001201 commented 9 months ago

I got the similar results with you.