Open Carol-lyh opened 1 year ago
Hi @Carol-lyh,
I am facing the same issue, have you figured out?
Here, I am also facing the same issue. Has anyone solved this to match the score?
This may be due to some unexpected randomness when using distributed training (https://github.com/haotian-liu/LLaVA/issues/864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector.
This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable.
Any observation/advice in terms of the randomness is welcomed.
@haotian-liu I also cannot reproduce on MMbench dev by using v1_5/finetune13B.sh。
开发集 dev | dev_overall | dev_attribute_reasoning属性推理 | dev_coarse_perception粗粒度感知 | dev_finegrained_perception (cross-instance)多对象感知 | dev_finegrained_perception (instance-level)单对象感知 | dev_logic_reasoning逻辑推理 | dev_relation_reasoning关系推理 |
---|---|---|---|---|---|---|---|
llava1.5-13b论文 | 68.2 | 67.3 | 82.1 | 59.4 | 72 | 44.1 | 60 |
llava1.5-13b-ours | 67.26 | 69.65 | 79.53 | 58.62 | 71.38 | 39.16 | 60.869 |
Hi @Carol-lyh, I also ran the finetune.sh with the 665k instruction dataset on 7B, but I have problems reproducing the results of GQA, TextVQA, and MME. My results are 58.2, 57.5. 1476.2. Just want to check, how did you run the experiment? is it just by executing the finetune.sh?
Hi @Carol-lyh, Have you tested mmvet? I used vlmevalkit, and the results of mmvet are much lower than that in vlmevalkit.
Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.
I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "
Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.
I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "
Sorry for the long delay in replying. I am currently using https://github.com/open-compass/VLMEvalKit for evaluation.
Question
I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings, but get 1457.7. It's a large gap between the number paper reported 1510 on MME. However, the evaluation results on other datasets seem reasonable (except the results on ScienceQA is much higher).
Here is the results: