Closed kexul closed 8 months ago
I'm not sure whether I'm expecting too much, here are the answers of GPT4V by three runs:
In the image, there are three glasses of wine visible on the table. Two of them are near the center of the table, and one is located near the bottom left corner.
There are two glasses of red wine visible on the table.
In the image, there are three glasses of wine visible on the table. Two glasses are located near the center of the table, and one glass is near the bottom left corner of the image.
Yes, this is expected. As mentioned in A.4 of our paper, our search process currently focuses on finding a single target object instead of locating all targets exhaustively. The search algorithm needs to be modified to exhaustively find all instances in the image and additional counting-related training data for VQA LLM is needed to make it able to understand the number of certain instances found by the visual search model. Also, note that our main focus is to show the importance of the visual search mechanism for multimodal systems in certain cases and our model is still expected to be weaker than GPT4V in general.
Thanks!
Hi, thanks for the great work! I've played it for a while and it's very very impressive! But the model failed most of my question when counting, is there some room for improvement?
For example, when I ask :
how many glasses of wine here?