klauscc / VindLU

MIT License
101 stars 11 forks source link

VindLU

VindLU : A Recipe for Effective Video-and-Language Pretraining [arXiv] [project page]

Feng Cheng, [Xizi Wang](), Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

Official PyTorch code for VindLU, a recipe for effective Video-and-Language (VidL) Pretraining.

News:

Highlights:

Results

Text-to-Video Retrieval (R@1 accuracy).
Pretrained Data MSR-VTT DiDeMo ANet SSV2-Label SSv2-Template Checkpoints
5M 43.8 54.6 51.1 51.2 82.2 model
17M 45.3 59.2 54.4 53.0 86.2 model
25M 46.5 61.2 55.0 53.1 83.3 model
Video Question Answering (Top-1 accuracy).
Pretrained Data ANet-QA MSRVTT-QA MSRVTT-MC TVQA Checkpoints
5M 44.2 43.6 95.2 79.0 model
17M 44.6 43.8 96.7 78.8 model
25M 44.7 44.6 97.1 79.0 model

Setup

The specific packages used in our experiment are detailed in vl.yml, you can easily create a conda env containing these packages.

# create 
conda env create -f vl.yml
# activate
conda activate vl

In your ~/.bashrc file, set the environment variables:

export VL_EXP_DIR="/path/to/ckpts_and_logs"
export VL_DATA_DIR="/path/to/data"

The datasets are stored under $VL_DATA_DIR and experiment outputs are stored under $VL_EXP_DIR. These variables are accessed by the config files in the configs/ directory.

[Optional] Our codebase support using wandb to monitor training. If you want to use wandb, you will need to set up it following this very short instruction, and also set wandb.enable in the configs to be True.

Data

Put your data following the following structure:

$VL_DATA_DIR
    |-- anno_pretrain     
        |-- webvid_train.sqlite.db
        |-- ...
    |-- anno_downstream
        |-- didemo_ret_train.json
        |-- ...
    |-- videos_images
        |-- webvid_2fps_224
            |-- 1053400385.mp4
            |-- ...
        |-- ...

Our prepared annotations are available on Google Drive.

Refer DATA.md to check how to prepare the image/video datasets.

The annotation file is in json format, which can be loaded as a list of dictionaries. Each dictionary is {'image': path_to_image, 'caption': image_caption} for image-text dataset, and is {'image': path_to_video, 'caption': video_caption} for video-text dataset. Note that we use the same key image for both image-text and video-text datasets for simplicity.

We store the pretraining annotation files using file-based database SQLite. SQLite allows us to load the captions on demand and thus save lots of CPU memory. If using json format, the Dataloader will cost more than 200GB CPU memory for 8 GPUs and 3 workers per GPU process. This is because each worker needs to maintain a copy of the json files in memory and the json files are too large (~5GB, and will be even larger when loaded as python objects).

You can use create_sqlite_db.py to convert the json annotation files into SQLite files.

Training and Inference

All the tasks can be launched via the python script tools/run.py.

If there is no slurm, you need to submit the training script to each node.

It will use slurm if command sbatch exists. You can force to run locally by add the argument --no_slurm.

Usage:

python tools/run.py --slurm_args SLURM_ARGS --jobname JOBNAME \
    --dep_jobname DEP_JOBNAME \
    --nnodes NNODES --ngpus NGPUS --task TASK \
    --config CONFIG_FILE --model_args MODEL_ARGS

Pre-Training

Example for pretraining on webvid_cc3m (5M):

corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
    --jobname $pt_name \
    --config configs/pretrain.py \
    --model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0"

You can use this script if 1) with slurm or 2) no slurm but only 1 node is used.

If using slurm, remember to add --slurm_args SLURM_ARGS according to your cluster's settings. The same for the following examples.

You can change corpus to "webvid_14m" for 17M corpus and "webvid10m_14m" for 25M corpus.

See variable available_corpus in configs/data.py for all the supported pretraining corpus. You can add your own datasets by adding them to available_corpus.

Multi-node pretrain without slurm

The following example will do pretrain on 2 nodes with 4 GPUs per node without slurm.

When running locally without slurm, you need


#### Finetuning and Evaluation

Our following examples are based on the pretrained model in the above section.

##### Text-to-video retrieval

Supported datasets: `msrvtt`, `msrvtt-9k`, `didemo`, `anet`.
Example for `msrvtt` dataset:
``` bash
dataset=msrvtt
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_${dataset}

if [[ "$dataset" == *"msrvtt"* ]]; then ngpus=4; else ngpus=1; fi
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}" 
Video Question Answering
dataset=msrvtt # supported: msrvtt, anet
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-qa_${dataset}

ngpus=1
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}" 
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_msrvtt

# evaluation
python tools/run.py --nnodes 1 --ngpus 1 --task retrieval_mc \
    --jobname ${ft_name}/eval_${nfrm_test}frm-mc --dep_jobname ${ft_name} \
    --config configs/ret_msrvtt_mc.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test 12"

Acknowledgement

This code used resources from Singularity, transformers, ALBEF, ClipBERT, frozen. The code is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{cheng2022vindlu,
  title={VindLU: A Recipe for Effective Video-and-Language Pretraining},
  author={Cheng, Feng and Wang, Xizi and Lei, Jie and Crandall, David and Bansal, Mohit and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2212.05051},
  year={2022}
}