Vídeos support? - Githubissues

Hi @RaulKite

I appreciate your interest in our work. The Phi-3-V and LLaMA-3-V models can be used with video inputs. However, the responses might not be good as the models are fine-tuned with image data only.

You can use a spatio-temporal feature extraction method from Video-ChatGPT to extract video-features and input them to the LLM (instead of image features) for generating response.

A more appropriate approach would be to fine-tune Phi-3-V and LLaMA-3-V models on VideoInstruct100K dataset to have a more comprehensive video understanding.

I hope it will help. Please let us know if you have any questions. Thank You

mbzuai-oryx / LLaVA-pp

Vídeos support? #3