mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
https://mbzuai-oryx.github.io/Video-ChatGPT
Creative Commons Attribution 4.0 International
1.05k stars 93 forks source link

from video_chatgpt.model.video_chatgpt import VideoChatGPTLlamaForCausalLM #12

Closed ChethanN01 closed 1 year ago

ChethanN01 commented 1 year ago

11] 4s !python scripts/apply_delta.py --base-model-path /content/drive/MyDrive/Video-ChatGPT/video_chatgpt-7B.bin --target-model-path LLaVA-Lightning-7B-v1-1 --delta-path liuhaotian/LLaVA-Lightning-7B-delta-v1-1 Traceback (most recent call last): File "/content/drive/MyDrive/Video-ChatGPT/scripts/apply_delta.py", line 10, in from video_chatgpt.model.video_chatgpt import VideoChatGPTLlamaForCausalLM ModuleNotFoundError: No module named 'video_chatgpt'

mmaaz60 commented 1 year ago

Hi @ChethanN01,

Thank You for your interest in our work. Please note that you have to add the root directory into the PYTHONPATH to run the Video-ChatGPT scripts.

Please run the following command from the root directory of the repository.

export PYTHONPATH="./:$PYTHONPATH

Please let me know if it solves the issue. Thank You.

ChethanN01 commented 1 year ago

thanks, please tell me how many hours of video your model can check and analyze

hanoonaR commented 1 year ago

Hi @ChethanN01 ,

Thanks for your interest in Video-ChatGPT and for bringing up this question.

Currently, the model has been primarily trained on the ActivityNet dataset, which comprises videos with an average duration of around 2 minutes. The model takes 100 uniformly sampled frames from each video during both training and inference.

While we haven't extensively tested the model with significantly longer videos the current sampling method should technically be applicable. However, as the video length increases, the representativeness of the sampled frames could diminish because we're still only sampling 100 frames.

In essence, the model is capable of processing longer videos, but the effectiveness may not be the same as for the shorter videos it was trained on. We encourage you to experiment with longer videos and share your findings.

Thank you.