sauradip / STALE

[ECCV 2022] Official Pytorch Implementation of the paper : " Zero-Shot Temporal Action Detection via Vision-Language Prompting "
https://sauradip.github.io/project_pages/STALE/
98 stars 10 forks source link
action-detection clip prompt-tuning temporal-action-detection temporal-action-localization transformers video-understanding vision-language

PWC PWC

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Sauradip Nag1,2,+Xiatian Zhu1,3Yi-Zhe Song1,2Tao Xiang1,2
1CVSSP, University of Surrey, UK  2iFlyTek-Surrey Joint Research Center on Artificial Intelligence, UK 
3Surrey Institute for People-Centred Artificial Intelligence, UK
+corresponding author

Accepted to ECCV 2022

Paper | Project Page

Updates

Summary

Abstract

Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms stateof-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors.

Architecture

Getting Started

Requirements

Environment Setup

It is suggested to create a Conda environment and install the following requirements

pip3 install -r requirements.txt

Extra Dependencies

We have used the implementation of Maskformer for Representation Masking.

git clone https://github.com/sauradip/STALE.git
cd STALE
git clone https://github.com/facebookresearch/MaskFormer

Follow the Installation instructions to install Detectron and other modules within this same environment if possible. After this step, place the files in /STALE/extra_files into /STALE/MaskFormer/mask_former/modeling/transformer/.

Download Features

Download the video features and update the Video paths/output paths in config/anet.yaml file. For now ActivityNetv1.3 dataset config is available. We are planning to release the code for THUMOS14 dataset soon.

Dataset Feature Backbone Pre-Training Link
ActivityNet ViT-B/16-CLIP CLIP Google Drive
THUMOS ViT-B/16-CLIP CLIP Google Drive
ActivityNet I3D Kinetics-400 Google Drive
THUMOS I3D Kinetics-400 Google Drive

Training Splits

Currently we support the training-splits provided by EfficientPrompt paper. Both 50% and 75% labeled data split is available for training. This can be found in STALE/splits

Model Training

To train STALE from scratch run the following command. The training configurations can be adjusted from config/anet.yaml file.

python stale_train.py

Model Inference

We provide the pretrained models containing the checkpoints for both 50% and 75% labeled data split for zero-shot setting Dataset Split (Seen-Unseen) Feature Link
ActivityNet 50%-50% CLIP ckpt
ActivityNet 75%-25% CLIP ckpt

After downloading the checkpoints, the checkpoints path can be saved in config/anet.yaml file. The model inference can be then performed using the following command

python stale_inference.py

Model Evaluation

To evaluate our STALE model run the following command.

python eval.py

TO-DO Checklist

Acknowledgement

Our source code is based on implementations of DenseCLIP, MaskFormer and CoOP. We thank the authors for open-sourcing their code.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{nag2022zero,
  title={Zero-shot temporal action detection via vision-language prompting},
  author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-Zhe and Xiang, Tao},
  journal={arXiv e-prints},
  pages={arXiv--2207},
  year={2022}
}