open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.27k stars 183 forks source link

Cannot Reproduce Results of LLaVA-1.5 on MMBench_DEV_CN #404

Closed ppalantir closed 1 month ago

ppalantir commented 2 months ago

Thank you for your awesome work!

I followed the README.md and used command python run.py --data MMBench_DEV_CN --model llava_v1.5_13b --verbose, and found the overall accuracy is much lower than LLaVA-1.5's reported result (MMBench-cn 63.6)

image
FangXinyu-0913 commented 2 months ago

The overall accuracy reported in LLaVA-1.5 is based on the MMBench_CN dataset, aggregated from the TEST set and the DEV set. In my already established environment, I tested the overall accuracy of MMBench_CN to match the official one and ested the accuracy of MMBench_DEV_CN as shown below. The fact that you measured low is probably related to the environment. The part of the environment I built is as follows, you can refer to the upgrade and re-test.

image image

ppalantir commented 2 months ago

The overall accuracy reported in LLaVA-1.5 is based on the MMBench_CN dataset, aggregated from the TEST set and the DEV set. In my already established environment, I tested the overall accuracy of MMBench_CN to match the official one and ested the accuracy of MMBench_DEV_CN as shown below. The fact that you measured low is probably related to the environment. The part of the environment I built is as follows, you can refer to the upgrade and re-test.

image image

@FangXinyu-0913 Thank you very much for your reply. But your overall result of 54.72 is also lower than reported result (MMBench-cn 63.6) with a large margin. Do you have any idea about possible reasons?

PhoenixZ810 commented 2 months ago

Hi,

Please refer to issue#439 to see if you are experiencing the same problem.

Best regards.

kennymckormick commented 1 month ago

Hi, @ppalantir , We reproduce the problem and also find that the previous results of llava_v1.5 on some Chinese benchmarks (MMBench-CN, CCBench, etc.) can not be reproduced now. We have asked the authors of LLaVA but still cannot figure out the reason, so we have updated the leaderboard to align with the current evaluation results for now.