mbzuai-oryx / LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
751 stars 53 forks source link

Vídeos support? #3

Closed RaulKite closed 2 months ago

RaulKite commented 2 months ago

Is there any way to use a video as input?

thanks

mmaaz60 commented 2 months ago

Hi @RaulKite

I appreciate your interest in our work. The Phi-3-V and LLaMA-3-V models can be used with video inputs. However, the responses might not be good as the models are fine-tuned with image data only.

You can use a spatio-temporal feature extraction method from Video-ChatGPT to extract video-features and input them to the LLM (instead of image features) for generating response.

A more appropriate approach would be to fine-tune Phi-3-V and LLaMA-3-V models on VideoInstruct100K dataset to have a more comprehensive video understanding.

I hope it will help. Please let us know if you have any questions. Thank You