[New Model]: LLaVA-OneVision

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.51k stars 4.43k forks source link

[New Model]: LLaVA-OneVision #7420

Closed ethanporcaro closed 1 month ago

ethanporcaro commented 2 months ago

The model to consider.

https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov

There are a bunch of others using the same architecture.

The closest model vllm already supports.

qwen2. AFAIK the main difference is a vision encoder which I think is based on siglip (also supported)

What's your difficulty of supporting the model you want?

Mixing qwen2 and siglip (maybe other changes)

ywang96 commented 2 months ago

Once we merge the PR to support multi-image/video input, it should be pretty straightforward to add the support for this model in vLLM!

DarkLight1337 commented 1 month ago

Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now.

litianjian commented 1 month ago

Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now. I have implemented llava-ov support. After the benchmark evaluation done, I will make a PR for this.

salvaba94 commented 1 month ago

I've tried this model with BitsAndBytes 4-bit quantization, it looks like it is not still supported like it is in HuggingFace Transformers. Do you also plan on adding support for quantization of this model?