showlab/VideoLISA - Githubissues

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

[Zechen Bai](https://www.baizechen.site/) ¹ [Tong He](https://hetong007.github.io/) ² [Haiyang Mei](https://mhaiyang.github.io/) ¹ [Pichao Wang](https://wangpichao.github.io/) ² [Ziteng Gao](https://sebgao.github.io/) ¹ [Joya Chen](https://chenjoya.github.io/) ¹ [Lei Liu](https://openreview.net/profile?id=~liulei2) ² [Zheng Zhang](https://scholar.google.com/citations?user=k0KiE4wAAAAJ&hl=en) ² [Mike Zheng Shou](https://sites.google.com/view/showlab) ¹ NeurIPS 2024 ¹ [Show Lab, National University of Singapore](https://sites.google.com/view/showlab/home?authuser=0) ² Amazon [![arXiv](https://img.shields.io/badge/arXiv-<2409.19603>-.svg)](https://arxiv.org/abs/2409.19603)

News

[2024-11-27] We released the ReasonVOS benchmark!
[2024-11-26] We released pre-trained VideoLISA-3.8B at HuggingFace!.
[2024-11-20] We released the training and inference code.
[2024-09-29] We released our paper on arXiv.

TODO

[X] Release the inference code.
[X] Release the training code.
[ ] Instructions on supporting more datasets.

Setup Environment

conda create -n videolisa python=3.10 -y
conda activate videolisa
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation

Prepare Data

First, please prepare the image data following this instruction in LISA.

We introduce the video datasets used in this project. Note that the data paths for video datasets are currently hard-coded in each dataset file in the utils folder. You may need to adjust it accordingly.

ReasonVOS

Please refer to BENCHMARK.md

MeViS

Download the dataset from the official release. Then, extract and organize the file. We expect the directory structure to be the following:

mevis
├── train                       // Split Train
│   ├── JPEGImages
│   │   ├── <video #1  >
│   │   ├── <video #2  >
│   │   └── <video #...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
├── valid_u                     // Split Val^u
│   ├── JPEGImages
│   │   └── <video ...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
└── valid                       // Split Val
    ├── JPEGImages
    │   └── <video ...>
    │
    └── meta_expressions.json

Ref-YouTube-VOS and Ref-DAVIS-17

Prepare Ref-YouTube-VOS and Ref-DAVIS-17 datasets following the instructions of ReferFormer.

YouTube-VOS

Download teh dataset from the website and organize it as follows:

YTVOS
├── train
│   ├── JPEGImages
│   ├── Annotations
│   ├── meta.json

Training

We provide a sample training script in run_train.sh. In our own experiments, we use 8 node (64 A10 24G GPUs) in total to train the model. Under this setting, we set batch_size=2 and grad_accumulation_steps=1, so that the global effective batch size is batch_size*grad_accumulation_steps*num_gpus=128. You can modify these settings based on your hardwares. However, we did not explore other training hyper-parameters. If you don't have sufficient GPUs, don't give up, you may still try to train the model with small batch size. One tip: if you use small batch size, also reducing the learning rate might help.

After training finished, to get the full model weight:

cd ./runs/video-lisa-3.8b-3k-iter/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Weight merging

Since the script do LoRA training with the help of deepspeed by default, after training, you need to merge the lora weights back to the model.

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="MBZUAI/LLaVA-Phi-3-mini-4k-instruct" \
  --weight="runs/video-lisa-3.8b-3k-iter/pytorch_model.bin" \
  --save_path="runs/video-lisa-3.8b-3k-iter/merged"

Evaluation

MeViS

Before jumping into the follow commands, you may look into the involved scripts and config the data paths.

# Step 1
bash evaluation/mevis_val_u/run_inference_mevis.sh

# Step 2
bash evaluation/mevis_val_u/run_eval_mevis.sh

Other Datasets

Ongoing.

Citation

To cite the paper and model, please use the below:

@article{bai2024videolisa,
  title={One token to seg them all: Language instructed reasoning segmentation in videos},
  author={Bai, Zechen and He, Tong and Mei, Haiyang and Wang, Pichao and Gao, Ziteng and Chen, Joya and Liu, Lei and Zhang, Zheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2409.19603},
  year={2024}
}

Acknowledgments

This work is heavily based on LISA, LLaVA, LLaVA-pp, Segment-Anything and Phi-3. Thanks to all the authors for their great works!