I noticed that in the released V* benchmark, the correct answer for each question is always the first option. I wonder if the authors have shuffled the options when evaluating on the benchmark (Table 1 in the paper)? I empirically found for models such as LLaVA1.5, when options are shuffled, the accuracy is way lower than not shuffled.
For open-source end-to-end models like LLaVA1.5, we use likelihood-based evaluation so the order doesn't matter. For other systems or methods like GPT4, we do shuffle the options.
Hi, congrats on the great work!
I noticed that in the released V* benchmark, the correct answer for each question is always the first option. I wonder if the authors have shuffled the options when evaluating on the benchmark (Table 1 in the paper)? I empirically found for models such as LLaVA1.5, when options are shuffled, the accuracy is way lower than not shuffled.
Thanks!