Multi-modal model support

Feature request

Increase support for multi-modal models going forward. Llava 1.6 is one option, but waiting for whatever best model comes out next (IDEFICS 2?) would be fine too.

Motivation

Inference API support for multi-modal models is much weaker than for LLMs when it comes to open source. It's hard for open source developers to fine-tune multi-modal but it's even harder to do inference at even a small production level (e.g. Llava 1.6 is supported by SGLang, which is fine but more obscure than TGI or vLLM).

Your contribution

If someone has visibility on models coming out, it would be great to align with those teams and try to get one or two models supported (FWIW the original IDEFICS works with TGI AFAIK, but it's outdated in performance).

huggingface / text-generation-inference