open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.39k stars 194 forks source link

Reproducing Qwen2-VL-72B-Instruct Evaluation Results Fails #580

Closed ChuanyangZheng closed 1 day ago

ChuanyangZheng commented 2 weeks ago

Hi, I git clone your code (commit 0c44cd2845a0fab51580c5654a86c0da96c7d155) and follow [Official Qwen2-VL (https://github.com/QwenLM/Qwen2-VL) to run Qwen2-VL-72B-Instruct, however, the evaluation results on the OpenCompass Multi-modal Leaderboard benchmarks are different from that of learderboard. Since I don't use GPT as I have no access to GPT, some results such as HallusionBench are low as expected. However, some resutls such as AI2D_TEST are much low. Here is our environments:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

相关依赖

accelerate==0.34.2 transformers==4.45.0.dev0 torch==2.4.0+cu121 torchaudio==2.4.0+cu121 torchvision==0.19.0+cu121 nvidia-cublas-cu12==12.6.3.3 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.560.30 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.1.105 nvidia-nvtx-cu12==12.1.105

Here is our evaluation results:

image

Thanks for your time~

kennymckormick commented 2 weeks ago

Hi, @ChuanyangZheng ,

Actually , for other MCQ benchmarks like MMMU or AI2D_TEST, we also use GPT-3.5-Turbo for choice label extraction. So it's possible that the reproduced results without GPT is a little bit lower.

ChuanyangZheng commented 2 weeks ago

Could I download the results from your leaderboard evaluation including the detailed predications? So I can compare the difference in a more exact way. Since the 3.9 performance gap on AI2D_TEST may exceed the possbile range of GPT-3.5-Turbo?

kennymckormick commented 2 weeks ago

Sure, I'm going to upload the evaluation records to https://huggingface.co/datasets/VLMEval/OpenVLMRecords/tree/main/mmeval today.

ChuanyangZheng commented 2 weeks ago

Thanks, we found all the gaps (including AI2D_TEST) come from the evaluation with/wo GPT. The detailed predications of ours aligns with yours.

kennymckormick commented 1 day ago

Glad to see the problem is resolved.