Firstly, I appreciate for your outstanding project.
I've been playing with your benchmark using the provided codebase, and noticed some discrepancies between the evaluation results I obtained and those reported in your paper.
Below is a comparative overview table of my evaluation results on the VG datasets and PRC tasks using COCO_order and Flickr30k_order:
model
pretrained
vg_relation
vg_attribution
coco_order
flickr30k_order
Task Avg.
ViT-B-32
openai
59.9%
63.2%
47.4%
58.8%
57.3%
NegCLIP
coco ft
80.2%
70.5%
86.8%
89.7%
81.8%
BLIP-base
flickr ft
49.7%
89.9%
42.5%
40.5%
55.7%
BLIP-base
coco ft
58.4%
89.5%
37.1%
46.3%
57.8%
I observed that my reproduced results for VG_Relation and VG_Attribution closely align with the numbers presented in your paper. However, I have concerns regarding the NegCLIP results for flickr30k_order, where a 91% (0.91) is reported in your paper (Appendix Table 6).
In addition, regarding the BLIP models, there seems to be a somewhat higher discrepancy.
In your paper from Appendix Table 5, the results reported are 0.369 for Flickr30k-PRC (BLIP-flickr-base) and 0.321 for COCO-PRC (BLIP-coco-base). In contrast, my results showed significantly higher scores of 40.5% and 37.1%, respectively, for the same models.
Note 1: I observed that some level of randomness arises when creating order annotations from the original annotation file. However, this randomness does not seem to cause the large gap observed.
Note 2: To account for the randomness, I maintained the same order annotations across different models in my experiments.
Given that these results were obtained using the provided codes and checkpoints, I would like to see any potential my mistakes or what was gone wrong by any reason.
Dear Authors,
Firstly, I appreciate for your outstanding project. I've been playing with your benchmark using the provided codebase, and noticed some discrepancies between the evaluation results I obtained and those reported in your paper.
Below is a comparative overview table of my evaluation results on the VG datasets and PRC tasks using COCO_order and Flickr30k_order:
I observed that my reproduced results for VG_Relation and VG_Attribution closely align with the numbers presented in your paper. However, I have concerns regarding the NegCLIP results for flickr30k_order, where a 91% (0.91) is reported in your paper (Appendix Table 6).
In addition, regarding the BLIP models, there seems to be a somewhat higher discrepancy. In your paper from Appendix Table 5, the results reported are 0.369 for Flickr30k-PRC (BLIP-flickr-base) and 0.321 for COCO-PRC (BLIP-coco-base). In contrast, my results showed significantly higher scores of 40.5% and 37.1%, respectively, for the same models.
Note 1: I observed that some level of randomness arises when creating order annotations from the original annotation file. However, this randomness does not seem to cause the large gap observed.
Note 2: To account for the randomness, I maintained the same order annotations across different models in my experiments.
Given that these results were obtained using the provided codes and checkpoints, I would like to see any potential my mistakes or what was gone wrong by any reason.
Best regards,