Closed justinphan3110 closed 4 months ago
Hi, @justinphan3110 ,
TL;DR: The GPT4 score reported by the original LLaVA project is the Relative Score reported in VLMEvalKit.
Why: The context is different. LLaVA Project: GPT4 score means the score is evaluated by GPT4; VLMEvalKit: GPT4 score means it's the score of the reference answer (answered by GPT4).
oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got 83
on LlavaBench but this is the score I got from running on VLMEvalKit:
split Relative Score (main) VLM Score GPT4 Score
0 overall 66.6 52.2 78.3
1 conv 57.9 49.4 85.3
2 complex 76.2 58.2 76.4
3 detail 59.5 44.0 74.0
This is the cmd that I used:
model="llava_next_mistral_7b"
task="LLaVABench"
!torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20
Is this big gap in scores (66.6 and 83) expected?
oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got
83
on LlavaBench but this is the score I got from running on VLMEvalKit:split Relative Score (main) VLM Score GPT4 Score 0 overall 66.6 52.2 78.3 1 conv 57.9 49.4 85.3 2 complex 76.2 58.2 76.4 3 detail 59.5 44.0 74.0
This is the cmd that I used:
model="llava_next_mistral_7b" task="LLaVABench" !torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20
Is this big gap in scores (66.6 and 83) expected?
Did the author mention which version of GPT-4 are they using for evaluating LLaVA 1.6? The original LLaVABench adopts GPT-4-0314 (which is not available to new users now), while VLMEvalKit adopts GPT-4-1106. The different version of GPT-4 may lead to significant changes in the final score.
I see, changing to GPT-4-0613
also brought the score back to ~80.0
. Thanks for the clarification.
Hi author,
I saw that there are 3 columns for LlavaBench scores (Relative Score, VLM Score, GPT4 Score), and seems like in evaluate code the Relative Score is calculated based on the VLM Score and GPT4 Score. However, seem like the original LLaVA project only report the GPT4 Score (this table). So how did the VLM Score and Relative Score calculated and reported? Or am I missing something?