ninatu / everything_at_once

Official implementation of "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval". CVPR 2022
96 stars 18 forks source link

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R., Harwath, D., Glass, J. and Kuehne, H. Everything at Once – Multi-modal Fusion Transformer for Video Retrieval. In CVPR, 2022.

arXiv preprint arXiv:2112.04446

alt text

Accepted at CVPR 2022!

Repository contains:

Get started

  1. Create an environment:

    conda create python=3.6 -y -n everything_at_once
    conda activate everything_at_once 
    conda install -y pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch
    pip install gensim==3.8.0 sacred==0.8.2 humanize==3.14.0 transformers==4.10.2 librosa==0.8.1 timm==0.4.12
    pip install neptune-contrib==0.28.1 --ignore-installed certifi
  2. If needed, download data.tar with features and spectrograms to fine-tune and evaluate on YouCook2 and MSR-VTT here. Extract a tar: tar -xvf data.tar

  3. If needed, create pretrained_models folder and download model weights here:

    Extract a tar:

    cd pretrained_models
    tar -xvf everything_at_once_tva.tar


To evaluate a pretrained everything-at-once model on the MSR-VTT dataset, run:

python --n_gpu 1  \
  --config configs/evaluation/msrvtt_at_once.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

On the YouCook2 dataset:

python --n_gpu 1  \
  --config configs/evaluation/youcook_at_once.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

Check out configs/evaluation folder to find more configs for evaluating models trained with S3D or CLIP features, or using other strategies to process long videos.


To fine-tune the HowTo100M-pretrained model on the MSR-VTT dataset, run:

python \
  --config configs/finetuning/finetune_msrvtt.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

Add --neptune key if you want to log experiments using (See Experiment Logging)

On the YouCook2 dataset:

python \
  --config configs/finetuning/finetune_youcook.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

Add --neptune key if you want to log experiments using (See Experiment Logging)

Check out configs/finetunning/clip folder to find configs for fine-tuning with CLIP features.


  1. Downloading HowTo100M and feature extraction. Please note that HowTo100M videos require a huge storage, and features alone take up terabytes of space. Features extraction (ResNet-152,ResNeXt-101) and audio spectrogram extraction were carefully described in We will release the code for S3D and CLIP feature extraction.

  2. Review configs/pretraining/everything_at_once_tva.yaml and make sure csv, features_path, features_path_audio, and caption_path point on the correct paths. CSV file should contain one column named 'path' with a list of videos. An example of the CSV file that we used in the training can be found here (HowTo100M_1166_videopaths.txt).

  3. Train python --config configs/pretraining/everything_at_once_tva.yaml

Add --neptune key if you want to log experiments using (See Experiment Logging)

Check out configs/pretraining folder to find more configs for different ablation experiments.

Experiment Logging

This repository uses Sacred with a for logging and tracking experiments. If you want to activate this:

  1. Create a account.
  2. Create a project, copy in your credentials (api_token, project_name) in
  3. Add --neptune key to the training (e.g. python --neptune ..)

Using the model on your own data

If you want to use the model on your own data, please follow steps described in for features extraction and audio spectrogram extraction.

You may also take a look at everything_at_once_tva.yaml, where some comments about how to define n_video_tokens and num_audio_STFT_frames are provided.


If you use this code in your research, please cite:

  title={Everything at Once-Multi-Modal Fusion Transformer for Video Retrieval},
  author={Shvetsova, Nina and Chen, Brian and Rouditchenko, Andrew and Thomas, Samuel and Kingsbury, Brian and Feris, Rogerio S and Harwath, David and Glass, James and Kuehne, Hilde},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},


If you have any problems with the code or have a question, please open an issue or send an email to shvetsova at I'll try to answer as soon as possible.

Acknowledgments and Licenses

The main structure of the code is based on the frozen-in-time code:, which itself is based on the pytorch-template Thanks for sharing good practices!

The code in,, is partly derived from,,, and, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).