Open Stani-s opened 2 weeks ago
This is already feasible with models that support it. Could you point to where the authors mention PaliGemma being trained on multi-image domains?
At the time of writing, supported models include:
cc @zucchini-nlp
Hi,
In section 3.2.4 of the paper https://arxiv.org/html/2407.07726v1#S3 the authors mention multi image inputs as a possible target task to finetune towards.
There is also a checkpoint of the model https://huggingface.co/google/paligemma-3b-ft-nlvr2-448 that was finetuned for such tasks.
Thank you for linking those models, I'll check them out.
Good point! Indeed seems that Paligemma can take more than one image as input.
AFAIK most if not all VLMs can theoretically support multiple images in one prompt, even though they aren't tuned explicitly and can have poor generation quality. So, imo we can accept any number of images as input to a VLM by default, if there isn't any strict requirement in modeling that prohibits multiple images. WDYT @NielsRogge?
For Paligemma @Stani-s you can give it a try, or I'll make a PR some time next week
Unfortunately I won't be able to for the next 2 weeks.
Feature request
Adding the ability to pass many images per prompt to PaliGemma. This would mean, among other changes, to change the argument type of
images
on PaliGemmaProcessor to allow array[array[torch.Tensor]] for batch processing.Motivation
The model was trained for multi image / short video tasks so it should be able to take such inputs.
Your contribution
I could document this if this is supported and I missed it.