Closed RaulKite closed 6 months ago
Hi @RaulKite
I appreciate your interest in our work. The Phi-3-V and LLaMA-3-V models can be used with video inputs. However, the responses might not be good as the models are fine-tuned with image data only.
You can use a spatio-temporal feature extraction method from Video-ChatGPT to extract video-features and input them to the LLM (instead of image features) for generating response.
A more appropriate approach would be to fine-tune Phi-3-V and LLaMA-3-V models on VideoInstruct100K dataset to have a more comprehensive video understanding.
I hope it will help. Please let us know if you have any questions. Thank You
Is there any way to use a video as input?
thanks