First of all, congratulations for this great work.
I have a question related to the benchmark. In the .json files, there is the 'bbox' parameter where the coordinates for the objects to be located in the format [x, y, width, height] (I assume). Two questions about it:
To determine the coordinates of the object, did you do it manually? I mean, did a human "draw" the bounding box, and then write the coordinates for the .json file? Or did you ask the LLM to locate the object, have the coordinates for the bounding box printed, and then write it on the .json?
After revising the code, I don't get the point for this 'bbox' parameter on the .json files. You don't use it anywhere in the 'vstar_bench_eval' script, right? I only see the 'question' and 'options' parameters of the 'annotation' variable being used through the main function. Am I missing something?
No, we do not use the human labeled bounding box in the evaluation. The human labeled bounding box is just provided as a reference or ground truth, and during the evaluation, the models should find the target objects by themselves.
Hi,
First of all, congratulations for this great work.
I have a question related to the benchmark. In the .json files, there is the 'bbox' parameter where the coordinates for the objects to be located in the format [x, y, width, height] (I assume). Two questions about it:
Thanks in advance.