penghao-wu / vstar

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
https://vstar-seal.github.io/
MIT License
497 stars 32 forks source link

Improvement for counting? #4

Closed kexul closed 8 months ago

kexul commented 8 months ago

Hi, thanks for the great work! I've played it for a while and it's very very impressive! But the model failed most of my question when counting, is there some room for improvement?

For example, when I ask : how many glasses of wine here? image image

kexul commented 8 months ago

I'm not sure whether I'm expecting too much, here are the answers of GPT4V by three runs:

In the image, there are three glasses of wine visible on the table. Two of them are near the center of the table, and one is located near the bottom left corner.

There are two glasses of red wine visible on the table.

In the image, there are three glasses of wine visible on the table. Two glasses are located near the center of the table, and one glass is near the bottom left corner of the image.
penghao-wu commented 8 months ago

Yes, this is expected. As mentioned in A.4 of our paper, our search process currently focuses on finding a single target object instead of locating all targets exhaustively. The search algorithm needs to be modified to exhaustively find all instances in the image and additional counting-related training data for VQA LLM is needed to make it able to understand the number of certain instances found by the visual search model. Also, note that our main focus is to show the importance of the visual search mechanism for multimodal systems in certain cases and our model is still expected to be weaker than GPT4V in general.

kexul commented 8 months ago

Thanks!