Open ChuanyangZheng opened 1 week ago
Hi, @ChuanyangZheng ,
Actually , for other MCQ benchmarks like MMMU or AI2D_TEST, we also use GPT-3.5-Turbo for choice label extraction. So it's possible that the reproduced results without GPT is a little bit lower.
Could I download the results from your leaderboard evaluation including the detailed predications? So I can compare the difference in a more exact way. Since the 3.9 performance gap on AI2D_TEST may exceed the possbile range of GPT-3.5-Turbo?
Sure, I'm going to upload the evaluation records to https://huggingface.co/datasets/VLMEval/OpenVLMRecords/tree/main/mmeval today.
Thanks, we found all the gaps (including AI2D_TEST) come from the evaluation with/wo GPT. The detailed predications of ours aligns with yours.
Hi, I git clone your code (commit 0c44cd2845a0fab51580c5654a86c0da96c7d155) and follow [Official Qwen2-VL (https://github.com/QwenLM/Qwen2-VL) to run Qwen2-VL-72B-Instruct, however, the evaluation results on the OpenCompass Multi-modal Leaderboard benchmarks are different from that of learderboard. Since I don't use GPT as I have no access to GPT, some results such as HallusionBench are low as expected. However, some resutls such as AI2D_TEST are much low. Here is our environments:
accelerate==0.34.2 transformers==4.45.0.dev0 torch==2.4.0+cu121 torchaudio==2.4.0+cu121 torchvision==0.19.0+cu121 nvidia-cublas-cu12==12.6.3.3 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.560.30 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.1.105 nvidia-nvtx-cu12==12.1.105
Here is our evaluation results:
Thanks for your time~