tsb0601 / MMVP

287 stars 7 forks source link

The accuracy of LLaVA-1.5-7b with CLIP encoder is 60.0 on MMVP #8

Closed Richar-Du closed 8 months ago

Richar-Du commented 8 months ago

I evaluated LLaVA-1.5-7b on the MMVP dataset and found that its accuracy is 60.0%, which is significantly higher than the 24.7% reported in Table 3. Upon comparing the evaluation code, I discovered that the prompt used in https://github.com/tsb0601/MMVP/blob/763500597e65c3446f09047837ceda76f4e264bf/scripts/evaluate_mllm.py#L59 differs from the one used by LLaVA-1.5, which is: '{question}\nA. {}\nB. {}\nAnswer with the option's letter from the given choices directly.' Given the first prompt, the model generates a long sentence as the answer, whereas with the second prompt, the model provides the option directly. This difference in prompts leads to the large discrepancy in accuracy. Anyway, the question is a binary choice, where a random guess would result in 50% accuracy; therefore, an accuracy of 60% also implies a significant problem with MLLMs :)

Richar-Du commented 8 months ago

Sorry for the misunderstanding, the answer is treated as correct only if both questions in the pair are answered correctly. In that case, the accuracy is 22.67, which is clear to the result in Table 3.

minhlong94 commented 8 months ago

Hi @Richar-Du , can you share the code you run? I somehow get very weired response from LLaVa 1.5, but I didn't know which part I got wrong. I used the exact code provided but it repeated itself instead of answering.

Z-MU-Z commented 8 months ago

@minhlong94 I modified https://github.com/tsb0601/MMVP/blob/763500597e65c3446f09047837ceda76f4e264bf/LLaVA/llava/model/language_model/llava_llama.py#L76, to input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)and the results seemed reasonable. However, I also found that LLava tends to generate longer sentences, which is different from MMVP-MOF. After I add cur_prompt = cur_prompt.replace("(a)", "\n(a)").replace("(b)", "\n(b)") after https://github.com/tsb0601/MMVP/blob/763500597e65c3446f09047837ceda76f4e264bf/scripts/evaluate_mllm.py#L59, following @Richar-Du,the model's response has indeed become shorter.
Does this imply that the training data or settings of MoF are different from those of LLava? @tsb0601