Open Anthony6197 opened 8 months ago
@Anthony6197 I think the llava code support input format in image list according to this code. https://github.com/haotian-liu/LLaVA/blob/main/llava/model/llava_arch.py#L114 what you should do is just modify the data process code.
@Anthony6197 I think the llava code support input format in image list according to this code. https://github.com/haotian-liu/LLaVA/blob/main/llava/model/llava_arch.py#L114 what you should do is just modify the data process code.
But I think the more sever problem is the lack of such data in the training dataset... By the way, I want to inquire whether there are any high-quality interleaved multi-image datasets?
@tingxueronghua pretrain dataset: webvid-10M(video), blip_laion_cc_sbu_558k(image),Charades_v1(video) sft dataset: refer to videochat, llava, videollama, mplug, llavar
@xmy0916 Thanks!
@Anthony6197 I think the llava code support input format in image list according to this code. https://github.com/haotian-liu/LLaVA/blob/main/llava/model/llava_arch.py#L114 what you should do is just modify the data process code.
But I think the more sever problem is the lack of such data in the training dataset... By the way, I want to inquire whether there are any high-quality interleaved multi-image datasets?
Hi @tingxueronghua , have you implemented multiple images input and got some exact results?
Question
Dear LLaVA Developer Team,
I must say the LMM is truly brilliant! 😊 I have a question: is LLaVA capable of performing video-QA? In other words, can the model accept a video or a set of sampled frames as input? We are currently working on creating a video-QA benchmark and are exploring the possibility of using LLaVA as one of our baseline models.
Thank you for your assistance in clarifying this.