The result of gpt4o on MathVista is 61.6, which is lower than the 62.7 on the list.

open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks

https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

Apache License 2.0

1.28k stars 182 forks source link

The result of gpt4o on MathVista is 61.6, which is lower than the 62.7 on the list. #553

Open kkk123698745 opened 1 week ago

kkk123698745 commented 1 week ago

After reproducing the gpt-4o on mathvista，we found the result of gpt4o on MathVista is 61.6, which is lower than the 62.7 on the list. Is there something i missed to get that scores. We have also test the gpt-4o with the prompt ending with 'solution:‘ and got a score at 65.3%。

kennymckormick commented 6 days ago

Hi, @kkk123698745 , I will rerun the evaluation and post the results here (fact check: are you using gpt-4o-20240806?). By the way, we are happy to update the evaluation result if you can create a PR to integrate ur better evaluation setting.

kkk123698745 commented 5 days ago

@kennymckormick Thanks for your reply， Expecting your result of rerun. Yeah, we used gpt-4o-20240806. In deed，we just run your code and got a score at 61.6% without any modification on prompt and got a score at 65.3% with the prompt adding '\nsolution:' at the tail.The full prompt like this:'Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. Question: Is the number of purple metallic things that are behind the small green motorbike less than the number of blue metal articulated buss? Choices: (A) Yes (B) No solution:'

kennymckormick commented 2 days ago

Hi, @kkk123698745 , I have performed two checks:

Reperform the evaluation (get accuracy based on old prediction): we find the accuracy is almost the same. We got 62.6% accuracy while our reported one is 62.7%. You can reproduce the results w. our released prediction files here: https://huggingface.co/datasets/VLMEval/OpenVLMRecords.

Screenshot:

Re-conduct the entire evaluation process: At the time there is larger difference. I only got 60.9% accuracy, which is about 2% lower than our reported results.

Screenshot:

I think the most possible reason for such difference is that, since GPT-4o is a proprietary model (a black box for us), maybe OpenAI has made some modifications to the model behind so the results now are different. For API models, our evaluation results only stands for the performance at the corresponding timestamp.