mbzuai-oryx / groundingLMM

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
https://grounding-anything.com
735 stars 37 forks source link

How can I let the model receive multiple images at once #60

Open bibibabibo26 opened 1 month ago

bibibabibo26 commented 1 month ago

Can your model be fed with multiple images at once, such as different frames of a video? Or can it be modified so that the input to the language model is the tokens of multiple images at once?

mmaaz60 commented 1 month ago

Hi @bibibabibo26,

Thank you for your interest in our work. The current GLaMM model is designed to work with single image only. However, it can be modified to accept multiple images. At the LLM part, it would be relatively simpler as we can consider multiple images as video frames and concatenate the images. In the grounding part, we may have to introduce special tokens to decide if the generated <seg> token refers to the first of second image. Alternatively, we need to design a segmentation encoder-decoder architecture that can work with multiple images.

Please do share if you have made any progress towards this interesting research direction. Good Luck!