[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Same here. Seems to be a toy model.
You can use florence2 or paligemma. They are much better with similar functions.
Or go grounding dino + SAM , they are old but also can handle something.
Hi,
I tried to use the modified app.py to work with my own data, but got strange segmentation results:
I don't know what went wrong and I hope to get your help. Here is the modified code:
def inference(input_str, all_inputs, follow_up, generate):
bbox_img = all_inputs['boxes']
if name == "main": args = parse_args(sys.argv[1:]) tokenizer = setup_tokenizer_and_special_tokens(args) model = initialize_model(args, tokenizer) model = prepare_model_for_inference(model, args) global_enc_processor = CLIPImageProcessor.from_pretrained(model.config.vision_tower) transform = ResizeLongestSide(args.image_size) model.eval()