Zero-Shot Temporal Action Detection via Vision-Language Prompting

Sauradip Nag^1,2,+ Xiatian Zhu^1,3 Yi-Zhe Song^1,2 Tao Xiang^1,2

¹CVSSP, University of Surrey, UK ²iFlyTek-Surrey Joint Research Center on Artificial Intelligence, UK
³Surrey Institute for People-Centred Artificial Intelligence, UK

⁺corresponding author

Accepted to ECCV 2022

Paper | Project Page

Updates

(July, 2022) We released STALE training and inference code for ActivityNetv1.3 dataset.
(June, 2022) STALE is accepted by ECCV 2022.

Summary

First prompt-guided framework for Zero-Shot Temporal Action Detection (ZS-TAD) task.
Adapted classification based CLIP to detection based TAD using Representation Masking.
Transformer based Cross-Adaptation module to contextualize classifier using Vision-Language features.
Inter-Branch consistency learning to make sure our model can find the accurate boundary.

Abstract

Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms stateof-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors.

Architecture

Getting Started

Requirements

Python 3.7
PyTorch == 1.9.0 (Please make sure your pytorch version is atleast 1.8)
NVIDIA GPU
Hugging-Face Transformers
Detectron

Environment Setup

It is suggested to create a Conda environment and install the following requirements

pip3 install -r requirements.txt

Extra Dependencies

We have used the implementation of Maskformer for Representation Masking.

git clone https://github.com/sauradip/STALE.git
cd STALE
git clone https://github.com/facebookresearch/MaskFormer

Follow the Installation instructions to install Detectron and other modules within this same environment if possible. After this step, place the files in /STALE/extra_files into /STALE/MaskFormer/mask_former/modeling/transformer/.

Download Features

Download the video features and update the Video paths/output paths in config/anet.yaml file. For now ActivityNetv1.3 dataset config is available. We are planning to release the code for THUMOS14 dataset soon.

Dataset	Feature Backbone	Pre-Training	Link
ActivityNet	ViT-B/16-CLIP	CLIP	Google Drive
THUMOS	ViT-B/16-CLIP	CLIP	Google Drive
ActivityNet	I3D	Kinetics-400	Google Drive
THUMOS	I3D	Kinetics-400	Google Drive

Training Splits

Currently we support the training-splits provided by EfficientPrompt paper. Both 50% and 75% labeled data split is available for training. This can be found in STALE/splits

Model Training

To train STALE from scratch run the following command. The training configurations can be adjusted from config/anet.yaml file.

python stale_train.py

Model Inference

We provide the pretrained models containing the checkpoints for both 50% and 75% labeled data split for zero-shot setting	Dataset	Split (Seen-Unseen)	Feature	Link
ActivityNet	50%-50%	CLIP	ckpt
ActivityNet	75%-25%	CLIP	ckpt

After downloading the checkpoints, the checkpoints path can be saved in config/anet.yaml file. The model inference can be then performed using the following command

python stale_inference.py

Model Evaluation

To evaluate our STALE model run the following command.

python eval.py

TO-DO Checklist

[ ] Fix the learnable-prompt issue in Huggig-Face Transformer
[x] Fix the NaN bug during Model-Training
[ ] Support for THUMOS14 dataset
[x] Enable multi-gpu training

Acknowledgement

Our source code is based on implementations of DenseCLIP, MaskFormer and CoOP. We thank the authors for open-sourcing their code.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{nag2022zero,
  title={Zero-shot temporal action detection via vision-language prompting},
  author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-Zhe and Xiang, Tao},
  journal={arXiv e-prints},
  pages={arXiv--2207},
  year={2022}
}

sauradip / STALE

readme