haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.14k stars 2.22k forks source link

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #680

Open Carol-lyh opened 1 year ago

Carol-lyh commented 1 year ago

Question

I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings, but get 1457.7. It's a large gap between the number paper reported 1510 on MME. However, the evaluation results on other datasets seem reasonable (except the results on ScienceQA is much higher).

Here is the results:

exp GQA ScienceQA TextVQA POPE MME
paper 62.0 66.8 58.2 85.9 1510.7
ours 62.6 70.8 58.3 85.8 1457.7
yix-chen commented 1 year ago

Hi @Carol-lyh,

I am facing the same issue, have you figured out?

becxer commented 11 months ago

Here, I am also facing the same issue. Has anyone solved this to match the score?

haotian-liu commented 11 months ago

This may be due to some unexpected randomness when using distributed training (https://github.com/haotian-liu/LLaVA/issues/864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector.

This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable.

Any observation/advice in terms of the randomness is welcomed.

shipengai commented 11 months ago

@haotian-liu I also cannot reproduce on MMbench dev by using v1_5/finetune13B.sh。

开发集 dev dev_overall dev_attribute_reasoning属性推理 dev_coarse_perception粗粒度感知 dev_finegrained_perception (cross-instance)多对象感知 dev_finegrained_perception (instance-level)单对象感知 dev_logic_reasoning逻辑推理 dev_relation_reasoning关系推理
llava1.5-13b论文 68.2 67.3 82.1 59.4 72 44.1 60
llava1.5-13b-ours 67.26 69.65 79.53 58.62 71.38 39.16 60.869

cathyxl commented 11 months ago

Hi @Carol-lyh, I also ran the finetune.sh with the 665k instruction dataset on 7B, but I have problems reproducing the results of GQA, TextVQA, and MME. My results are 58.2, 57.5. 1476.2. Just want to check, how did you run the experiment? is it just by executing the finetune.sh?

yuangpeng commented 6 months ago

Hi @Carol-lyh, Have you tested mmvet? I used vlmevalkit, and the results of mmvet are much lower than that in vlmevalkit.

image
BaohaoLiao commented 6 months ago

Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.

I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "

yuangpeng commented 6 months ago

Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.

I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "

Sorry for the long delay in replying. I am currently using https://github.com/open-compass/VLMEvalKit for evaluation.