[Question] Cannot reproduce MME results on LLaVA-1.5-7B

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

18.41k stars 2.02k forks source link

[Question] Cannot reproduce MME results on LLaVA-1.5-7B #630

Open yix-chen opened 9 months ago

yix-chen commented 9 months ago

Question

Hi, I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings except flash-attention on A100, and got 1466.6. Given that the paper reported 1510 on MME, is that a normal fluctuation or some hyperparameters need to be tweaked.

haotian-liu commented 9 months ago

Hi, can you share the number of the official checkpoints you evaluate on your local machine (to make sure that the eval is consistent)? Also, what about the numbers on other datasets? Are they consistently lower (can you also share them)? Thanks.

yix-chen commented 9 months ago

Hi Haotian,

The MME evaluation on official v1.5-7B checkpoint is fine, which is 1508.9. And on other datasets, the results are also consistent with the reported. So I wonder if something went wrong in finetuning, e.g., flash-attention was not used?

haotian-liu commented 9 months ago

Hi @yix-chen

I have not tested running without flash-attention, but theoretically, it is an exact-attention optimization, so with or without does not significantly affect the results.

It seems that the eval is fine, but it is still hard to determine the cause with MME performance alone. Can you share the numbers of more datasets you have tested, so that we can see both the trend and the exact absolute difference? Thanks.

Carol-lyh commented 9 months ago

We cannot reproduce the results on MME either, our result is 1457.7

TempleX98 commented 9 months ago

We also failed to reproduce the official performance. Our model got a score of 1473.

haotian-liu commented 8 months ago

This may be due to some unexpected randomness when using distributed training (https://github.com/haotian-liu/LLaVA/issues/864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector.

This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable.

Any observation/advice in terms of the randomness is welcomed.

eehover commented 7 months ago

try to set the deepspeeed zero1 config. the loss will be same every time

zjysteven commented 1 month ago

Can confirm the same thing here. I'm using lmms-eval for evaluation. The released llava-1.5-7b checkpoint got 1512 on MME, while my retrained/reproduced version got only 1478.