Does llava-next-video deploy only focus on first frames?

sgl-project / sglang

SGLang is yet another fast serving framework for large language models and vision language models.

Apache License 2.0

3.29k stars 201 forks source link

Does llava-next-video deploy only focus on first frames? #510

Open LetheRiver0 opened 1 month ago

LetheRiver0 commented 1 month ago

I'm trying to deploy llava-next-video with sglang, and it can successfully work. But I find it only focus on the first frame of input, like if I input 10 frames, and let model to describe it. And the generation only contains first frame's information. Dose anyone know what happend? Thanks~ Also, where can I print the input token for model? I want to check if all frames are input to model

AmazDeng commented 1 month ago

I'm having a similar problem to you.

I deployed sglang , and loaded the llava-next-image model, but sglang can only do a single inference. If I do batch inference, for example, batch_size=10, sglang can only reason about the first 5, and the last 5 get stuck and can't be reasoned
2.I'm trying to load the llava-next-model model for inference, but sglang can't reason the result

Luodian commented 5 days ago

Indeed, our first version code patch has the mentioned issue. We will send a new PR along with our new models to fix above mentioned issues. Sorry for keep you waiting.