Open TKONIY opened 1 month ago
Do you plan to make a PR for this? FYI, the support for multi-image (which is essentially what video Llava is doing) is indeed in our Q3 roadmap, so it would be great if we collaborate on the effort.
Yes but I haven't finished yet. I am working on it.
I will make a PR this week. It will support a dynamic number of input frames, which is important but not supported by SGLang.
SGlang needs to set a num_frames
parameters when launching a llava-next-video model, and simply asserts all the input videos contains num_frames
frames. If input frames are less than num_frames
, their embedding will be padded with the wrong numbers. If the input frames are more than num_frames
, their embedding will be truncated. In both situations the results are unexpected.
Instead, in LLM, the length of embedding will be calculated from the arrived requests to support videos with different frames.
The model to consider.
LLaVA-NeXT-Video* (LlavaNextVideoForConditionalGeneration)
The closest model vllm already supports.
Llava-Next (LlavaNextForConditionalGeneration)
What's your difficulty of supporting the model you want?