Correct check for SDPA in Vision Language Models

zucchini-nlp commented 6 months ago

System Info

In current implementation of VLMs, the "_supports_sdpa" attribute checks and activates SDPA attention only for the language model. For example in Llava

It should also check and if available use SDPA attention for vision tower.

CLIP SDPA has an open PR: https://github.com/huggingface/transformers/pull/30390
SigLip SDPA is merged: https://github.com/huggingface/transformers/pull/31499

We can raise a warning for composite models if only one part support sdpa, but other does not, and activate SDPA for the supported part. That waythe user knows what is happening in the background.

Verified models

[ ] BLIP-2
[ ] InstructBLIP
[ ] InstructBLIPVideo
[ ] KOSMOS-2
[ ] LLaVa
[ ] LLaVa-NeXT
[ ] LLaVa-NeXT-Video
[ ] VipLLaVa
[ ] Video-LLaVa
[ ] Idefics
[ ] Idefics2
[ ] PaliGemma

NielsRogge commented 6 months ago

Edited your issue to include a list of models to check ;) feel free to expand

lucasjinreal commented 4 months ago

Please consider add for specific Vsion Encoders, such as Siglip CLIP InternViT etc.

huggingface / transformers

Correct check for SDPA in Vision Language Models #30565

System Info

Verified models