tsb0601 / MMVP

260 stars 7 forks source link

Discrepancy between the code and Table 1 in the paper #5

Open MajorDavidZhang opened 5 months ago

MajorDavidZhang commented 5 months ago

Hi, thanks for your insightful work! I am using your MMVP benchmark to test different CLIP models' performance. However, when I run the exactly code from evaluate_vlm.py, I cannot get the same results as in Table 1 in the paper. My results are:

Orientation and Direction Presence of Specific Features State and Condition Quantity and Count Positional and Relational Context Color and Appearance Structural and Physical Characteristics Texts Viewpoint and Perspective
26.7 13.3 26.7 6.7 6.7 40 26.7 13.3 20

, which is different from the first row of Table 1 in the paper, and different from all the rows of Table 1. Could you confirm that? Thanks very much!

lst627 commented 4 months ago

I could not get the same results either, and I found that repeating the same command multiple times yielded different results on 'Positional and Relational Context', 'Structural Characteristics', and 'Orientation and Direction'. But I got the same results in other categories.

pavank-apple commented 4 weeks ago

Similar issue, unable to reproduce the results in Table 1 even when running the exact code in the repo. @tsb0601 any advice? Unlike @lst627, I get consistent results across multiple runs, but they are consistently worse than the reported numbers.