Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms stateof-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors.
It is suggested to create a Conda environment and install the following requirements
pip3 install -r requirements.txt
We have used the implementation of Maskformer for Representation Masking.
git clone https://github.com/sauradip/STALE.git
cd STALE
git clone https://github.com/facebookresearch/MaskFormer
Follow the Installation instructions to install Detectron and other modules within this same environment if possible. After this step, place the files in /STALE/extra_files
into /STALE/MaskFormer/mask_former/modeling/transformer/
.
Download the video features and update the Video paths/output paths in config/anet.yaml
file. For now ActivityNetv1.3 dataset config is available. We are planning to release the code for THUMOS14 dataset soon.
Dataset | Feature Backbone | Pre-Training | Link |
---|---|---|---|
ActivityNet | ViT-B/16-CLIP | CLIP | Google Drive |
THUMOS | ViT-B/16-CLIP | CLIP | Google Drive |
ActivityNet | I3D | Kinetics-400 | Google Drive |
THUMOS | I3D | Kinetics-400 | Google Drive |
Currently we support the training-splits provided by EfficientPrompt paper. Both 50% and 75% labeled data split is available for training. This can be found in STALE/splits
To train STALE from scratch run the following command. The training configurations can be adjusted from config/anet.yaml
file.
python stale_train.py
We provide the pretrained models containing the checkpoints for both 50% and 75% labeled data split for zero-shot setting | Dataset | Split (Seen-Unseen) | Feature | Link |
---|---|---|---|---|
ActivityNet | 50%-50% | CLIP | ckpt | |
ActivityNet | 75%-25% | CLIP | ckpt |
After downloading the checkpoints, the checkpoints path can be saved in config/anet.yaml
file.
The model inference can be then performed using the following command
python stale_inference.py
To evaluate our STALE model run the following command.
python eval.py
Our source code is based on implementations of DenseCLIP, MaskFormer and CoOP. We thank the authors for open-sourcing their code.
If you find this project useful for your research, please use the following BibTeX entry.
@article{nag2022zero,
title={Zero-shot temporal action detection via vision-language prompting},
author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-Zhe and Xiang, Tao},
journal={arXiv e-prints},
pages={arXiv--2207},
year={2022}
}