Adding OWLViT/OWLV2 as options for the visual grounding part

skulshreshtha commented 7 months ago

🚀 Feature

Currently, the project uses GroundingDINO as the visual grounding model which is the best performing model for some benchmark datasets current benchmarks for zero-shot object detection We can provide the user flexibility to choose between different visual grounding models like

Motivation & Examples

Tell us why the feature is useful. Since this project is about text guided segmentation, adding the ability to choose the technique for visual grounding pipeline seems like a natural addition.

Describe what the feature would look like, if it is implemented. Best demonstrated using code examples in addition to words.

from PIL import Image
from lang_sam import LangSAM

# Initialize and select visual grounding model if desired. Default will be 'groundingdino'. Other options are 'ofa', 'owlvit', and 'owlv2'
model = LangSAM(model = 'groundingdino') 
image_pil = Image.open("./assets/car.jpeg").convert("RGB")
text_prompt = "wheel"
masks, boxes, phrases, logits = model.predict(image_pil, text_prompt)

Note

We only consider adding new features if they are relevant to this library. Consider if this new feature deserves to be here or should be a new library.

luca-medeiros commented 7 months ago

@skulshreshtha Interesting! Do you want to try an implementation for it?

skulshreshtha commented 7 months ago

@luca-medeiros Yes, sure. If you think this makes sense, I can try and raise a PR for this.

ogencoglu commented 1 month ago

+1 for this

luca-medeiros / lang-segment-anything