mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
https://mbzuai-oryx.github.io/Video-ChatGPT
Creative Commons Attribution 4.0 International
1.17k stars 102 forks source link
chatbot clip gpt-4 llama llava mulit-modal vicuna video-chatboat video-conversation vision-language vision-language-pretraining

Oryx Video-ChatGPT :movie_camera: :speech_balloon:

Oryx Video-ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [ACL 2024 🔥]

Muhammad Maaz , Hanoona Rasheed , Salman Khan and Fahad Khan

* Equally contributing first authors

Mohamed bin Zayed University of Artificial Intelligence


Diverse Video-based Generative Performance Benchmarking (VCGBench-Diverse)

PWC

Video-based Generative Performance Benchmarking

PWC

Zeroshot Question-Answer Evaluation

PWC PWC PWC PWC


Demo Paper Demo Clips Offline Demo Training Video Instruction Data Quantitative Evaluation Qualitative Analysis
Demo YouTube paper DemoClip-1 DemoClip-2 DemoClip-3 DemoClip-4 Offline Demo Training Video Instruction Dataset Quantitative Evaluation Qualitative Analysis

:loudspeaker: Latest Updates


Online Demo :computer:

:fire::fire: You can try our demo using the provided examples or by uploading your own videos HERE. :fire::fire:

:fire::fire: Or click the image to try the demo! :fire::fire: demo You can access all the videos we demonstrate on here.


Video-ChatGPT Overview :bulb:

Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation.

Video-ChatGPT Architectural Overview


Contributions :trophy:

Contributions


Installation :wrench:

We recommend setting up a conda environment for the project:

conda create --name=video_chatgpt python=3.10
conda activate video_chatgpt

git clone https://github.com/mbzuai-oryx/Video-ChatGPT.git
cd Video-ChatGPT
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo Offline :cd:

To run the demo offline, please refer to the instructions in offline_demo.md.


Training :train:

For training instructions, check out train_video_chatgpt.md.


Video Instruction Dataset :open_file_folder:

We are releasing our 100,000 high-quality video instruction dataset that was used for training our Video-ChatGPT model. You can download the dataset from here. More details on our human-assisted and semi-automatic annotation framework for generating the data are available at VideoInstructionDataset.md.


Quantitative Evaluation :bar_chart:

Our paper introduces a new Quantitative Evaluation Framework for Video-based Conversational Models. To explore our benchmarks and understand the framework in greater detail, please visit our dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT.

For detailed instructions on performing quantitative evaluation, please refer to QuantitativeEvaluation.md.

Video-based Generative Performance Benchmarking and Zero-Shot Question-Answer Evaluation tables are provided for a detailed performance overview.

Zero-Shot Question-Answer Evaluation

Model MSVD-QA MSRVTT-QA TGIF-QA Activity Net-QA
Accuracy Score Accuracy Score Accuracy Score Accuracy Score
FrozenBiLM 32.2 -- 16.8 -- 41.0 -- 24.7 --
Video Chat 56.3 2.8 45.0 2.5 34.4 2.3 26.5 2.2
LLaMA Adapter 54.9 3.1 43.8 2.7 - - 34.2 2.7
Video LLaMA 51.6 2.5 29.6 1.8 - - 12.4 1.1
Video-ChatGPT 64.9 3.3 49.3 2.8 51.4 3.0 35.2 2.7

Video-based Generative Performance Benchmarking

Evaluation Aspect Video Chat LLaMA Adapter Video LLaMA Video-ChatGPT
Correctness of Information 2.23 2.03 1.96 2.40
Detail Orientation 2.50 2.32 2.18 2.52
Contextual Understanding 2.53 2.30 2.16 2.62
Temporal Understanding 1.94 1.98 1.82 1.98
Consistency 2.24 2.15 1.79 2.37

Qualitative Analysis :mag:

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks :movie_camera:

sample1


Creative and Generative Tasks :paintbrush:

sample5


Spatial Understanding :globe_with_meridians:

sample8


Video Understanding and Conversational Tasks :speech_balloon:

sample10


Action Recognition :runner:

sample22


Question Answering Tasks :question:

sample14


Temporal Understanding :hourglass_flowing_sand:

sample18


Acknowledgements :pray:

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

@inproceedings{Maaz2023VideoChatGPT,
    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
    year={2024}
}

License :scroll:

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! :star2: Please raise any issues or questions here.