showlab / videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Apache License 2.0
222 stars 27 forks source link

VideoLLM-online: Online Video Large Language Model for Streaming Video

Homepage Demo Paper Checkpoint Data

TLDR

The first streaming video LLM, high speed (5 ~ 10 FPS on NVIDIA 3090 GPU, 10 ~ 15 FPS on A100GPU) on long-form videos (10 minutes), with SOTA performance on online/offline settings.

Click to Play

Introduction

This is the official implementation of VideoLLM-online: Online Video Large Language Model for Streaming Video, CVPR 2024. Our paper introduces several interesting stuffs compared to popular image/video/multimodal models:

Quick Start

But if there are some bugs with flash-attn, try to use

python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus --attn_implementation sdpa

By passing --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus, the PEFT checkpoint will be automatically downloaded and applied to meta-llama/Meta-Llama-3-8B-Instruct.

Installation

Ensure you have Miniconda and Python version >= 3.10 installed, then run:

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

PyTorch source will make ffmpeg installed, but it is an old version and usually make very low quality preprocessing. Please install newest ffmpeg following:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

If you want to try our model with the audio in real-time streaming, please also clone ChatTTS.

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

Training and Evaluation

Model Zoo

VideoLLM-online-8B-v1+

VideoLLM-online-8B-v1

VideoLLM-online beyond Llama

This codebase has a very simple and clean implementation. You only need to change the inherited class from Llama to Mistral to achieve the Mistral version of VideoLLM-online. Please refer to the examples in models/live_llama.

Citation

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}