Closed Richar-Du closed 8 months ago
Sorry for the misunderstanding, the answer is treated as correct only if both questions in the pair are answered correctly. In that case, the accuracy is 22.67, which is clear to the result in Table 3.
Hi @Richar-Du , can you share the code you run? I somehow get very weired response from LLaVa 1.5, but I didn't know which part I got wrong. I used the exact code provided but it repeated itself instead of answering.
@minhlong94 I modified https://github.com/tsb0601/MMVP/blob/763500597e65c3446f09047837ceda76f4e264bf/LLaVA/llava/model/language_model/llava_llama.py#L76, to input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
and the results seemed reasonable. However, I also found that LLava tends to generate longer sentences, which is different from MMVP-MOF. After I add cur_prompt = cur_prompt.replace("(a)", "\n(a)").replace("(b)", "\n(b)")
after https://github.com/tsb0601/MMVP/blob/763500597e65c3446f09047837ceda76f4e264bf/scripts/evaluate_mllm.py#L59, following @Richar-Du,the model's response has indeed become shorter.
Does this imply that the training data or settings of MoF are different from those of LLava? @tsb0601
I evaluated LLaVA-1.5-7b on the MMVP dataset and found that its accuracy is 60.0%, which is significantly higher than the 24.7% reported in Table 3. Upon comparing the evaluation code, I discovered that the prompt used in https://github.com/tsb0601/MMVP/blob/763500597e65c3446f09047837ceda76f4e264bf/scripts/evaluate_mllm.py#L59 differs from the one used by LLaVA-1.5, which is: '{question}\nA. {}\nB. {}\nAnswer with the option's letter from the given choices directly.' Given the first prompt, the model generates a long sentence as the answer, whereas with the second prompt, the model provides the option directly. This difference in prompts leads to the large discrepancy in accuracy. Anyway, the question is a binary choice, where a random guess would result in 50% accuracy; therefore, an accuracy of 60% also implies a significant problem with MLLMs :)