AI2d gpt和claude3.5官方分数非常高

open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks

https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

Apache License 2.0

1.39k stars 194 forks source link

AI2d gpt和claude3.5官方分数非常高 #577

Open Violettttee opened 2 weeks ago

Violettttee commented 2 weeks ago

您好～想请问下你们对于openai和claude3.5在ai2d上特别高的分数有任何建议和想法吗？我这边修改姿势和prompt（添加cot）评测了gpt多次，都无法复现出0.942的超高分数。（加了cot后的最高分也就0.83），想请问你们对于这个gap有什么想法？（我看你们这边的ai2d的评测分数也没有任何高于0.9以上的，很好奇claude和gpt是怎么测出来将近满分的

kennymckormick commented 2 weeks ago

Hi, @Violettttee , You can try the AI2D_TEST_NO_MASK dataset we provided, which generally display better performance compared to AI2D_TEST due to the different setting. However, we still cannot reproduce the numbers reported by OpenAI or Anthropic.