Question about LlavaBench Evaluation

open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks

https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

Apache License 2.0

1.03k stars 142 forks source link

Question about LlavaBench Evaluation #194

Closed justinphan3110 closed 4 months ago

justinphan3110 commented 4 months ago

Hi author,

I saw that there are 3 columns for LlavaBench scores (Relative Score, VLM Score, GPT4 Score), and seems like in evaluate code the Relative Score is calculated based on the VLM Score and GPT4 Score. However, seem like the original LLaVA project only report the GPT4 Score (this table). So how did the VLM Score and Relative Score calculated and reported? Or am I missing something?

kennymckormick commented 4 months ago

Hi, @justinphan3110 ,

TL;DR: The GPT4 score reported by the original LLaVA project is the Relative Score reported in VLMEvalKit.

Why: The context is different. LLaVA Project: GPT4 score means the score is evaluated by GPT4; VLMEvalKit: GPT4 score means it's the score of the reference answer (answered by GPT4).

justinphan3110 commented 4 months ago

oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got 83 on LlavaBench but this is the score I got from running on VLMEvalKit:

 split  Relative Score (main)  VLM Score  GPT4 Score
0  overall                   66.6       52.2        78.3
1     conv                   57.9       49.4        85.3
2  complex                   76.2       58.2        76.4
3   detail                   59.5       44.0        74.0

This is the cmd that I used:

model="llava_next_mistral_7b"
task="LLaVABench"

!torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20

Is this big gap in scores (66.6 and 83) expected?

kennymckormick commented 4 months ago

oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got 83 on LlavaBench but this is the score I got from running on VLMEvalKit:
 split  Relative Score (main)  VLM Score  GPT4 Score
0  overall                   66.6       52.2        78.3
1     conv                   57.9       49.4        85.3
2  complex                   76.2       58.2        76.4
3   detail                   59.5       44.0        74.0
This is the cmd that I used:
model="llava_next_mistral_7b"
task="LLaVABench"

!torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20
Is this big gap in scores (66.6 and 83) expected?

Did the author mention which version of GPT-4 are they using for evaluating LLaVA 1.6? The original LLaVABench adopts GPT-4-0314 (which is not available to new users now), while VLMEvalKit adopts GPT-4-1106. The different version of GPT-4 may lead to significant changes in the final score.

justinphan3110 commented 4 months ago

I see, changing to GPT-4-0613 also brought the score back to ~80.0. Thanks for the clarification.