vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.93k stars 3.79k forks source link

[Model]: Llava-Next-Video support #6571

Open TKONIY opened 1 month ago

TKONIY commented 1 month ago

The model to consider.

LLaVA-NeXT-Video* (LlavaNextVideoForConditionalGeneration)

The closest model vllm already supports.

Llava-Next (LlavaNextForConditionalGeneration)

What's your difficulty of supporting the model you want?

image
ywang96 commented 1 month ago

Do you plan to make a PR for this? FYI, the support for multi-image (which is essentially what video Llava is doing) is indeed in our Q3 roadmap, so it would be great if we collaborate on the effort.

TKONIY commented 1 month ago

Yes but I haven't finished yet. I am working on it.

TKONIY commented 3 weeks ago

I will make a PR this week. It will support a dynamic number of input frames, which is important but not supported by SGLang.