open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.04k stars 145 forks source link

HallusionBench skips samples without images #471

Open ChuanyangZheng opened 3 days ago

ChuanyangZheng commented 3 days ago

Hello, I follow your Evaluation and evaluate HallusionBench using the script as follows: python run.py --data HallusionBench --model model --verbose --nproc=1 The model generated 951 results, which skipped the samples without images. The HallusionBench benchmark should evaluate 1,129 samples, including some pure text samples.

I found the possible cause lies in the skip setting. Running when setting skip_noimg=False will raise error.

Do I miss something? Thanks for your time!

kennymckormick commented 3 days ago

Hi, @ChuanyangZheng , In VLMEvalKit we did not involve the pure-text samples, cuz they are not related to multi-modal evaluation.

ChuanyangZheng commented 2 days ago

Thanks for your reply, I also want to ask is the MMBench V1.1 in the opencompass leaderboard evaluated in a way as follows? python run.py --data MMBench_V11 --model model --verbose --nproc=1 But MMBench_V11 is a Internal Only file. We can only access MMBench_TEST_EN_V11 and MMBench_TEST_CN_V11. So can we use the following commands

python run.py --data MMBench_TEST_EN_V11 --model model --verbose --nproc=1
python run.py --data MMBench_TEST_CN_V11 --model model --verbose --nproc=1

and average these two results to get the result in the leaderboard?

kennymckormick commented 2 days ago

@ChuanyangZheng

Yeah, and you need to submit the prediction file to https://mmbench.opencompass.org.cn/mmbench-submission to obtain the metrics.