[New Model]: LLaVA-NeXT-Video support

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.07k stars 4.54k forks source link

[New Model]: LLaVA-NeXT-Video support #5124

Closed AmazDeng closed 2 months ago

AmazDeng commented 5 months ago

The model to consider.

The llava-next-video project has already been released, and the test results are quite good. Are there any plans to support this project? https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video.md Currently, Hugging Face does not support this model.

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

No response

ywang96 commented 4 months ago

Hi there @AmazDeng! It looks like this model is already supported on transformers. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!

AmazDeng commented 4 months ago

transformers

Yes, the latest version of Transformers now supports the llava-next-video model. However, the inference speed is very slow. I hope you can support this model soon. Additionally, I have another question. Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?

ywang96 commented 4 months ago

Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?

I do think that's something we should support (and there's indeed an issue for this https://github.com/vllm-project/vllm/issues/416). This will be another API change so we need to make sure everything's compatible.

At least as a first step, we do plan to support image embeddings as input (instead of PIL.Image) for vision language models. This will be our part of Q3 roadmap.

TKONIY commented 3 months ago

Hi there @AmazDeng! It looks like this model is already supported on transformers. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!

I am trying to implement the Llava-Next-Video support. #6571