microsoft / SoM

Set-of-Mark Prompting for GPT-4V and LMMs
MIT License
1.11k stars 87 forks source link

I would like to ask how to do a visual grounding (REC) task directly using GPTY4v? #41

Open xiang-xiang-zhu opened 4 months ago

xiang-xiang-zhu commented 4 months ago

Thank you for your work! Now I would like to directly to GPT4v input the image and a prompt like “This is an image, now I need to do the visual grounding task where you generate the coordinates [x,y,h,w] of a bounding box based on a query.” But I found that this doesn't output very well, the model is even outputting the coordinates randomly. Should I have to preprocess the image first? How should this go about? Thank you!

abrichr commented 4 months ago

Current frontier multimodal models (e.g. GPT4) do not appear to be good at segmenting images.

At https://github.com/OpenAdaptAI/OpenAdapt we use Ultralytics FastSAM to run segmentation first with good results. See e.g. https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (scroll down to see images).

edit: https://x.com/openadaptai/status/1798502003045548480