Closed ethanporcaro closed 1 month ago
Once we merge the PR to support multi-image/video input, it should be pretty straightforward to add the support for this model in vLLM!
Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now.
Video inputs are now supported in vLLM with the addition of #6571, so it should be possible to implement this model now. I have implemented llava-ov support. After the benchmark evaluation done, I will make a PR for this.
I've tried this model with BitsAndBytes 4-bit quantization, it looks like it is not still supported like it is in HuggingFace Transformers. Do you also plan on adding support for quantization of this model?
The model to consider.
https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov
There are a bunch of others using the same architecture.
The closest model vllm already supports.
qwen2. AFAIK the main difference is a vision encoder which I think is based on siglip (also supported)
What's your difficulty of supporting the model you want?
Mixing qwen2 and siglip (maybe other changes)