How can I let the model receive multiple images at once

mbzuai-oryx / groundingLMM

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

735 stars 37 forks source link

Hi @bibibabibo26,

Thank you for your interest in our work. The current GLaMM model is designed to work with single image only. However, it can be modified to accept multiple images. At the LLM part, it would be relatively simpler as we can consider multiple images as video frames and concatenate the images. In the grounding part, we may have to introduce special tokens to decide if the generated <seg> token refers to the first of second image. Alternatively, we need to design a segmentation encoder-decoder architecture that can work with multiple images.

Please do share if you have made any progress towards this interesting research direction. Good Luck!

mbzuai-oryx / groundingLMM

How can I let the model receive multiple images at once #60