Open xiang-xiang-zhu opened 4 months ago
Current frontier multimodal models (e.g. GPT4) do not appear to be good at segmenting images.
At https://github.com/OpenAdaptAI/OpenAdapt we use Ultralytics FastSAM to run segmentation first with good results. See e.g. https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (scroll down to see images).
Thank you for your work! Now I would like to directly to GPT4v input the image and a prompt like “This is an image, now I need to do the visual grounding task where you generate the coordinates [x,y,h,w] of a bounding box based on a query.” But I found that this doesn't output very well, the model is even outputting the coordinates randomly. Should I have to preprocess the image first? How should this go about? Thank you!