penghao-wu / vstar

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
https://vstar-seal.github.io/
MIT License
497 stars 32 forks source link

V*Bench #12

Closed RetroVortex closed 2 months ago

RetroVortex commented 6 months ago

Hi,

First of all, congratulations for this great work.

I have a question related to the benchmark. In the .json files, there is the 'bbox' parameter where the coordinates for the objects to be located in the format [x, y, width, height] (I assume). Two questions about it:

  1. To determine the coordinates of the object, did you do it manually? I mean, did a human "draw" the bounding box, and then write the coordinates for the .json file? Or did you ask the LLM to locate the object, have the coordinates for the bounding box printed, and then write it on the .json?
  2. After revising the code, I don't get the point for this 'bbox' parameter on the .json files. You don't use it anywhere in the 'vstar_bench_eval' script, right? I only see the 'question' and 'options' parameters of the 'annotation' variable being used through the main function. Am I missing something?

Thanks in advance.

penghao-wu commented 6 months ago
  1. Yes, we manually annotate the bounding box.
  2. No, we do not use the human labeled bounding box in the evaluation. The human labeled bounding box is just provided as a reference or ground truth, and during the evaluation, the models should find the target objects by themselves.