Questions on evaluation results

Dear Authors,

Firstly, I appreciate for your outstanding project. I've been playing with your benchmark using the provided codebase, and noticed some discrepancies between the evaluation results I obtained and those reported in your paper.

Below is a comparative overview table of my evaluation results on the VG datasets and PRC tasks using COCO_order and Flickr30k_order:

model	pretrained	vg_relation	vg_attribution	coco_order	flickr30k_order	Task Avg.
ViT-B-32	openai	59.9%	63.2%	47.4%	58.8%	57.3%
NegCLIP	coco ft	80.2%	70.5%	86.8%	89.7%	81.8%
BLIP-base	flickr ft	49.7%	89.9%	42.5%	40.5%	55.7%
BLIP-base	coco ft	58.4%	89.5%	37.1%	46.3%	57.8%

I observed that my reproduced results for VG_Relation and VG_Attribution closely align with the numbers presented in your paper. However, I have concerns regarding the NegCLIP results for flickr30k_order, where a 91% (0.91) is reported in your paper (Appendix Table 6).

In addition, regarding the BLIP models, there seems to be a somewhat higher discrepancy. In your paper from Appendix Table 5, the results reported are 0.369 for Flickr30k-PRC (BLIP-flickr-base) and 0.321 for COCO-PRC (BLIP-coco-base). In contrast, my results showed significantly higher scores of 40.5% and 37.1%, respectively, for the same models.

Note 1: I observed that some level of randomness arises when creating order annotations from the original annotation file. However, this randomness does not seem to cause the large gap observed.

Note 2: To account for the randomness, I maintained the same order annotations across different models in my experiments.

Given that these results were obtained using the provided codes and checkpoints, I would like to see any potential my mistakes or what was gone wrong by any reason.

Best regards,

mertyg / vision-language-models-are-bows

Questions on evaluation results #33