In V* Benchmark evaluation, are the options randomly shuffled for each question?

penghao-wu / vstar

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"

https://vstar-seal.github.io/

MIT License

497 stars 32 forks source link

In V* Benchmark evaluation, are the options randomly shuffled for each question? #8

Closed bfshi closed 7 months ago

bfshi commented 7 months ago

Hi, congrats on the great work!

I noticed that in the released V* benchmark, the correct answer for each question is always the first option. I wonder if the authors have shuffled the options when evaluating on the benchmark (Table 1 in the paper)? I empirically found for models such as LLaVA1.5, when options are shuffled, the accuracy is way lower than not shuffled.

Thanks!

penghao-wu commented 7 months ago

For open-source end-to-end models like LLaVA1.5, we use likelihood-based evaluation so the order doesn't matter. For other systems or methods like GPT4, we do shuffle the options.

bfshi commented 7 months ago

I see. Thank you!