haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.4k stars 2.26k forks source link

Multiple Images or Video as Input #1465

Open AliAbdulRehman opened 7 months ago

AliAbdulRehman commented 7 months ago

Question

Can the chatbot have multiple sequential images as input? I'm trying to predict the pedestrian trajectory and my inputs are multiple frames of a video. How can the model understand sequential images?

ElliottDyson commented 6 months ago

Question

Can the chatbot have multiple sequential images as input? I'm trying to predict the pedestrian trajectory and my inputs are multiple frames of a video. How can the model understand sequential images?

It doesn't. Have a look at the video Llava repo