Discrepancy between the code and Table 1 in the paper

MajorDavidZhang commented 10 months ago

Hi, thanks for your insightful work! I am using your MMVP benchmark to test different CLIP models' performance. However, when I run the exactly code from evaluate_vlm.py, I cannot get the same results as in Table 1 in the paper. My results are:

Orientation and Direction	Presence of Specific Features	State and Condition	Quantity and Count	Positional and Relational Context	Color and Appearance	Structural and Physical Characteristics	Texts	Viewpoint and Perspective
26.7	13.3	26.7	6.7	6.7	40	26.7	13.3	20

, which is different from the first row of Table 1 in the paper, and different from all the rows of Table 1. Could you confirm that? Thanks very much!

lst627 commented 8 months ago

I could not get the same results either, and I found that repeating the same command multiple times yielded different results on 'Positional and Relational Context', 'Structural Characteristics', and 'Orientation and Direction'. But I got the same results in other categories.

pavank-apple commented 5 months ago

Similar issue, unable to reproduce the results in Table 1 even when running the exact code in the repo. @tsb0601 any advice? Unlike @lst627, I get consistent results across multiple runs, but they are consistently worse than the reported numbers.

HashmatShadab commented 3 months ago

Still no update on this

long8v commented 3 weeks ago

+1

tsb0601 / MMVP

Discrepancy between the code and Table 1 in the paper #5