Closed AmazDeng closed 2 months ago
Hi there @AmazDeng! It looks like this model is already supported on transformers
. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!
transformers
Yes, the latest version of Transformers now supports the llava-next-video model. However, the inference speed is very slow. I hope you can support this model soon. Additionally, I have another question. Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?
Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?
I do think that's something we should support (and there's indeed an issue for this https://github.com/vllm-project/vllm/issues/416). This will be another API change so we need to make sure everything's compatible.
At least as a first step, we do plan to support image embeddings as input (instead of PIL.Image
) for vision language models. This will be our part of Q3 roadmap.
Hi there @AmazDeng! It looks like this model is already supported on
transformers
. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!
I am trying to implement the Llava-Next-Video support. #6571
The model to consider.
The llava-next-video project has already been released, and the test results are quite good. Are there any plans to support this project?
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video.md
Currently, Hugging Face does not support this model.The closest model vllm already supports.
No response
What's your difficulty of supporting the model you want?
No response