haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
18.31k stars 2k forks source link

[Question] Benchmarking on Video-QA #765

Open Anthony6197 opened 8 months ago

Anthony6197 commented 8 months ago

Question

Dear LLaVA Developer Team,

I must say the LMM is truly brilliant! 😊 I have a question: is LLaVA capable of performing video-QA? In other words, can the model accept a video or a set of sampled frames as input? We are currently working on creating a video-QA benchmark and are exploring the possibility of using LLaVA as one of our baseline models.

Thank you for your assistance in clarifying this.

xmy0916 commented 8 months ago

@Anthony6197 I think the llava code support input format in image list according to this code. https://github.com/haotian-liu/LLaVA/blob/main/llava/model/llava_arch.py#L114 what you should do is just modify the data process code.

tingxueronghua commented 8 months ago

@Anthony6197 I think the llava code support input format in image list according to this code. https://github.com/haotian-liu/LLaVA/blob/main/llava/model/llava_arch.py#L114 what you should do is just modify the data process code.

But I think the more sever problem is the lack of such data in the training dataset... By the way, I want to inquire whether there are any high-quality interleaved multi-image datasets?

xmy0916 commented 8 months ago

@tingxueronghua pretrain dataset: webvid-10M(video), blip_laion_cc_sbu_558k(image),Charades_v1(video) sft dataset: refer to videochat, llava, videollama, mplug, llavar

xmy0916 commented 8 months ago

https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models#awesome-datasets

tingxueronghua commented 8 months ago

@xmy0916 Thanks!

jameszhou-gl commented 8 months ago

@Anthony6197 I think the llava code support input format in image list according to this code. https://github.com/haotian-liu/LLaVA/blob/main/llava/model/llava_arch.py#L114 what you should do is just modify the data process code.

But I think the more sever problem is the lack of such data in the training dataset... By the way, I want to inquire whether there are any high-quality interleaved multi-image datasets?

Hi @tingxueronghua , have you implemented multiple images input and got some exact results?