Closed ChuanyangZheng closed 1 month ago
Hi, @ChuanyangZheng , In VLMEvalKit we did not involve the pure-text samples, cuz they are not related to multi-modal evaluation.
Thanks for your reply, I also want to ask is the MMBench V1.1 in the opencompass leaderboard evaluated in a way as follows?
python run.py --data MMBench_V11 --model model --verbose --nproc=1
But MMBench_V11 is a Internal Only file.
We can only access MMBench_TEST_EN_V11 and MMBench_TEST_CN_V11. So can we use the following commands
python run.py --data MMBench_TEST_EN_V11 --model model --verbose --nproc=1
python run.py --data MMBench_TEST_CN_V11 --model model --verbose --nproc=1
and average these two results to get the result in the leaderboard?
@ChuanyangZheng
Yeah, and you need to submit the prediction file to https://mmbench.opencompass.org.cn/mmbench-submission to obtain the metrics.
Hello, I follow your Evaluation and evaluate HallusionBench using the script as follows:
python run.py --data HallusionBench --model model --verbose --nproc=1
The model generated 951 results, which skipped the samples without images. The HallusionBench benchmark should evaluate 1,129 samples, including some pure text samples.I found the possible cause lies in the skip setting. Running when setting skip_noimg=False will raise error.
Do I miss something? Thanks for your time!