arXiv preprint arXiv:2112.04446
Accepted at CVPR 2022!
Repository contains:
Create an environment:
conda create python=3.6 -y -n everything_at_once
conda activate everything_at_once
conda install -y pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch
pip install gensim==3.8.0 sacred==0.8.2 humanize==3.14.0 transformers==4.10.2 librosa==0.8.1 timm==0.4.12
pip install neptune-contrib==0.28.1 --ignore-installed certifi
If needed, download data.tar
with features and spectrograms to fine-tune
and evaluate on YouCook2 and MSR-VTT here. Extract a tar:
tar -xvf data.tar
If needed, create pretrained_models
folder and download model weights here:
Extract a tar:
cd pretrained_models
tar -xvf everything_at_once_tva.tar
To evaluate a pretrained everything-at-once model on the MSR-VTT dataset, run:
python test.py --n_gpu 1 \
--config configs/evaluation/msrvtt_at_once.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
On the YouCook2 dataset:
python test.py --n_gpu 1 \
--config configs/evaluation/youcook_at_once.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
Check out configs/evaluation
folder to find more configs
for evaluating models trained with S3D or CLIP features,
or using other strategies to process long videos.
To fine-tune the HowTo100M-pretrained model on the MSR-VTT dataset, run:
python train.py \
--config configs/finetuning/finetune_msrvtt.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
Add --neptune
key if you want to log experiments using neptune.ai (See Experiment Logging)
On the YouCook2 dataset:
python train.py \
--config configs/finetuning/finetune_youcook.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
Add --neptune
key if you want to log experiments using neptune.ai (See Experiment Logging)
Check out configs/finetunning/clip
folder to find configs
for fine-tuning with CLIP features.
Downloading HowTo100M and feature extraction. Please note that HowTo100M videos require a huge storage, and features alone take up terabytes of space. Features extraction (ResNet-152,ResNeXt-101) and audio spectrogram extraction were carefully described in https://github.com/roudimit/AVLnet/blob/main/training.md. We will release the code for S3D and CLIP feature extraction.
Review configs/pretraining/everything_at_once_tva.yaml
and make sure csv
, features_path
, features_path_audio
, and caption_path
point on the correct paths.
CSV file should contain one column named 'path' with a list of videos. An example of the CSV file that we used in the training can be found here (HowTo100M_1166_videopaths.txt).
Train python train.py --config configs/pretraining/everything_at_once_tva.yaml
Add --neptune
key if you want to log experiments using neptune.ai (See Experiment Logging)
Check out configs/pretraining
folder to find more configs for different ablation experiments.
This repository uses Sacred with a neptune.ai for logging and tracking experiments. If you want to activate this:
train.py
--neptune
key to the training (e.g. python train.py --neptune ..
)If you want to use the model on your own data, please follow steps described in https://github.com/roudimit/AVLnet for features extraction and audio spectrogram extraction.
You may also take a look at everything_at_once_tva.yaml
, where some comments about how to define n_video_tokens
and num_audio_STFT_frames
are provided.
If you use this code in your research, please cite:
@inproceedings{shvetsova2022everything,
title={Everything at Once-Multi-Modal Fusion Transformer for Video Retrieval},
author={Shvetsova, Nina and Chen, Brian and Rouditchenko, Andrew and Thomas, Samuel and Kingsbury, Brian and Feris, Rogerio S and Harwath, David and Glass, James and Kuehne, Hilde},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={20020--20029},
year={2022}
}
If you have any problems with the code or have a question, please open an issue or send an email to shvetsova at em.uni-frankfurt.de. I'll try to answer as soon as possible.
The main structure of the code is based on the frozen-in-time code: https://github.com/m-bain/frozen-in-time, which itself is based on the pytorch-template https://github.com/victoresque/pytorch-template. Thanks for sharing good practices!
The code in davenet.py
, layers.py
, avlnet.py
is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).