Currently, the project uses GroundingDINO as the visual grounding model which is the best performing model for some benchmark datasets
We can provide the user flexibility to choose between different visual grounding models like
Tell us why the feature is useful.
Since this project is about text guided segmentation, adding the ability to choose the technique for visual grounding pipeline seems like a natural addition.
Describe what the feature would look like, if it is implemented.
Best demonstrated using code examples in addition to words.
from PIL import Image
from lang_sam import LangSAM
# Initialize and select visual grounding model if desired. Default will be 'groundingdino'. Other options are 'ofa', 'owlvit', and 'owlv2'
model = LangSAM(model = 'groundingdino')
image_pil = Image.open("./assets/car.jpeg").convert("RGB")
text_prompt = "wheel"
masks, boxes, phrases, logits = model.predict(image_pil, text_prompt)
Note
We only consider adding new features if they are relevant to this library.
Consider if this new feature deserves to be here or should be a new library.
š Feature
Currently, the project uses
GroundingDINO
as the visual grounding model which is the best performing model for some benchmark datasets We can provide the user flexibility to choose between different visual grounding models likeMotivation & Examples
Tell us why the feature is useful. Since this project is about text guided segmentation, adding the ability to choose the technique for visual grounding pipeline seems like a natural addition.
Describe what the feature would look like, if it is implemented. Best demonstrated using code examples in addition to words.
Note
We only consider adding new features if they are relevant to this library. Consider if this new feature deserves to be here or should be a new library.