huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.99k stars 26.29k forks source link

Add multi image prompts to multimodal LLMs that support it (PaliGemma) #33113

Open Stani-s opened 2 weeks ago

Stani-s commented 2 weeks ago

Feature request

Adding the ability to pass many images per prompt to PaliGemma. This would mean, among other changes, to change the argument type of images on PaliGemmaProcessor to allow array[array[torch.Tensor]] for batch processing.

Motivation

The model was trained for multi image / short video tasks so it should be able to take such inputs.

Your contribution

I could document this if this is supported and I missed it.

NielsRogge commented 2 weeks ago

This is already feasible with models that support it. Could you point to where the authors mention PaliGemma being trained on multi-image domains?

At the time of writing, supported models include:

cc @zucchini-nlp

Stani-s commented 2 weeks ago

Hi,

In section 3.2.4 of the paper https://arxiv.org/html/2407.07726v1#S3 the authors mention multi image inputs as a possible target task to finetune towards.

There is also a checkpoint of the model https://huggingface.co/google/paligemma-3b-ft-nlvr2-448 that was finetuned for such tasks.

Thank you for linking those models, I'll check them out.

zucchini-nlp commented 2 weeks ago

Good point! Indeed seems that Paligemma can take more than one image as input.

AFAIK most if not all VLMs can theoretically support multiple images in one prompt, even though they aren't tuned explicitly and can have poor generation quality. So, imo we can accept any number of images as input to a VLM by default, if there isn't any strict requirement in modeling that prohibits multiple images. WDYT @NielsRogge?

For Paligemma @Stani-s you can give it a try, or I'll make a PR some time next week

Stani-s commented 2 weeks ago

Unfortunately I won't be able to for the next 2 weeks.