layer6ai-labs / xpool

https://layer6ai-labs.github.io/xpool/
116 stars 9 forks source link

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Satya Krishna Gorti*, Noël Vouitsis*, Junwei Ma*, Keyvan Golestan, Maksims Volkovs, Animesh Garg, Guangwei Yu

[Paper](https://arxiv.org/abs/2203.15086) | [Project Page & Demo](https://layer6ai-labs.github.io/xpool/)

Introduction

This repository contains the official implementation of our CVPR 2022 paper. It includes both training and evaluation code.

Dependencies

Our model was developed and evaluated using the following package dependencies:

  • PyTorch 1.8.1
  • Transformers 4.6.1
  • OpenCV 4.5.3

Datasets

We trained models on the MSR-VTT, MSVD and LSMDC datasets. To download the datasets, refer to this repository.

For LSMDC, you must obtain permission from MPII to download and use the data, so we do not provide the split and caption files in the data/ directory.

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files for each dataset. The commands will by default generate results for text-to-video retrieval (t2v). For video-to-text retrieval (v2t) results, add the argument --metric=v2t to the command.

If the outputs/ folder does not exist, first run mkdir outputs to create the directory. For each dataset, create a directory in outputs/ and store the corresponding checkpoint file. For each command below, replace {exp_name} with the name of that directory.

Also, replace {videos_dir} with the path to the dataset's videos.

For evaluation, you can change the batch_size without affecting results.

Dataset Command Checkpoint File t2v R@1 Result
MSR-VTT-9k python test.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --huggingface --load_epoch=-1 --dataset_name=MSRVTT --msrvtt_train_file=9k Link 46.9
MSR-VTT-7k python test.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --huggingface --load_epoch=-1 --dataset_name=MSRVTT --msrvtt_train_file=7k Link 43.9
MSVD python test.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --huggingface --load_epoch=-1 --dataset_name=MSVD Link 47.2
LSMDC python test.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --huggingface --load_epoch=-1 --dataset_name=LSMDC Link 25.2

Training

The following commands can be used to train our X-Pool model for each dataset. Again, the evaluation is by default set to generate results for text-to-video retrieval (t2v). For video-to-text retrieval (v2t) results, add the argument --metric=v2t to the command.

For each command below, replace {exp_name} with your choice name of experiment. Also, replace {videos_dir} with the path to the dataset's videos.

Dataset Command
MSR-VTT-9k python train.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --noclip_lr=3e-5 --transformer_dropout=0.3 --huggingface --dataset_name=MSRVTT --msrvtt_train_file=9k
MSR-VTT-7k python train.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --noclip_lr=1e-5 --transformer_dropout=0.4 --huggingface --dataset_name=MSRVTT --msrvtt_train_file=7k
MSVD python train.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --noclip_lr=1e-5 --transformer_dropout=0.4 --huggingface --dataset_name=MSVD
LSMDC python train.py --exp_name={exp_name} --videos_dir={videos_dir} --batch_size=32 --noclip_lr=1e-5 --transformer_dropout=0.3 --huggingface --dataset_name=LSMDC

Citation

If you find this work useful in your research, please cite the following paper:

@inproceedings{gorti2022xpool,
  title={X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval},
  author={Gorti, Satya Krishna and Vouitsis, No{\"e}l and Ma, Junwei and Golestan, Keyvan and Volkovs, Maksims and Garg, Animesh and Yu, Guangwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2022}
}