This code repo implements Actionformer, one of the first Transformer-based model for temporal action localization --- detecting the onsets and offsets of action instances and recognizing their action categories. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points and crossing the 60% mAP for the first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.56% average mAP) and the more challenging EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our paper is accepted to ECCV 2022 and an arXiv version can be found at this link.
In addition, ActionFormer is the backbone for many winning solutions in the Ego4D Moment Queries Challenge 2022. Our submission in particular is ranked 2nd with a record 21.76% average mAP and 42.54% Recall@1x, tIoU=0.5, nearly three times higher than the official baseline. An arXiv version of our tech report can be found at this link. We invite our audience to try out the code.
Specifically, we adopt a minimalist design and develop a Transformer based model for temporal action localization, inspired by the recent success of Transformers in NLP and vision. Our method, illustrated in the figure, adapts local self-attention to model temporal context in untrimmed videos, classifies every moment in an input video, and regresses their corresponding action boundaries. The result is a deep model that is trained using standard classification and regression loss, and can localize moments of actions in a single shot, without using action proposals or pre-defined anchor windows.
Related projects:
SnAG: Scalable and Accurate Video Grounding
Fangzhou Mu*, Sicheng Mo*, Yin Li
CVPR 2024
11/18/2022: We have released the tech report for our submission to the Ego4D Moment Queries (MQ) Challenge. The code repo now includes config files, pre-trained models and results on the Ego4D MQ benchmark.
08/29/2022: Updated arXiv version.
08/01/2022: Updated code repo with latest results on ActivityNet.
07/08/2022: The paper is accepted to ECCV 2022.
05/09/2022: Pre-trained models have been updated.
05/08/2022: We have updated the code repo based on the community feedback and our code review, leading to significantly better average mAP on THUMOS14 (>66.0%) and slightly improved results on ActivityNet and EPIC-Kitchens 100.
The structure of this code repo is heavily inspired by Detectron2. Some of the main components are
Download Features and Annotations
md5sum 375f76ffbf7447af1035e694971ec9b2
) from this Box link or this Google Drive link or this BaiduYun link.Details: The features are extracted from two-stream I3D models pretrained on Kinetics using clips of 16 frames
at the video frame rate (~30 fps
) and a stride of 4 frames
. This gives one feature vector per 4/30 ~= 0.1333
seconds.
Unpack Features and Annotations
This folder
│ README.md
│ ...
│
└───data/
│ └───thumos/
│ │ └───annotations
│ │ └───i3d_features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
python ./train.py ./configs/thumos_i3d.yaml --output reproduce
tensorboard --logdir=./ckpt/thumos_i3d_reproduce/logs
python ./eval.py ./configs/thumos_i3d.yaml ./ckpt/thumos_i3d_reproduce
[Optional] Evaluating Our Pre-trained Model
We also provide a pre-trained model for THUMOS 14. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───thumos_i3d_reproduce/
│ │ └───thumos_reproduce_log.txt
│ │ └───thumos_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
python ./eval.py ./configs/thumos_i3d.yaml ./pretrained/thumos_i3d_reproduce/
Method | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | Avg |
---|---|---|---|---|---|---|
ActionFormer | 82.13 | 77.80 | 70.95 | 59.40 | 43.87 | 66.83 |
Download Features and Annotations
md5sum c415f50120b9425ee1ede9ac3ce11203
) from this Box link or this Google Drive Link or this BaiduYun Link.Details: The features are extracted from the R(2+1)D-34 model pretrained with TSP on ActivityNet using clips of 16 frames
at a frame rate of 15 fps
and a stride of 16 frames
(i.e., non-overlapping clips). This gives one feature vector per 16/15 ~= 1.067
seconds. The features are converted into numpy files for our code.
Unpack Features and Annotations
This folder
│ README.md
│ ...
│
└───data/
│ └───anet_1.3/
│ │ └───annotations
│ │ └───tsp_features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
python ./train.py ./configs/anet_tsp.yaml --output reproduce
tensorboard --logdir=./ckpt/anet_tsp_reproduce/logs
python ./eval.py ./configs/anet_tsp.yaml ./ckpt/anet_tsp_reproduce
[Optional] Evaluating Our Pre-trained Model
We also provide a pre-trained model for ActivityNet 1.3. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───anet_tsp_reproduce/
│ │ └───anet_tsp_reproduce_log.txt
│ │ └───anet_tsp_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
python ./eval.py ./configs/anet_tsp.yaml ./pretrained/anet_tsp_reproduce/
Method | 0.5 | 0.75 | 0.95 | Avg |
---|---|---|---|---|
ActionFormer | 54.67 | 37.81 | 8.36 | 36.56 |
[Optional] Reproducing Our Results with I3D Features
md5sum e649425954e0123401650312dd0d56a7
) from this Google Drive Link.Details: The features are extracted from the I3D model pretrained on Kinetics using clips of 16 frames
at a frame rate of 25 fps
and a stride of 16 frames
. This gives one feature vector per 16/25 = 0.64
seconds. The features are converted into numpy files for our code.
Unpack the file under ./data (or elsewhere and link to ./data), similar to TSP features.
Train our ActionFormer with I3D features. This will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.
python ./train.py ./configs/anet_i3d.yaml --output reproduce
Evaluate the trained model. The expected average mAP should be around 36.0(%). This is slightly improved from our paper. The improvement is produced by better training scheme / hyperparameters (see comments in the config file).
python ./eval.py ./configs/anet_i3d.yaml ./ckpt/anet_i3d_reproduce
The pre-trained model with all training logs can be downloaded from this Google Drive link. To produce the results, create a folder ./pretrained, unpack the file under ./pretrained (or elsewhere and link to ./pretrained), and run
python ./eval.py ./configs/anet_i3d.yaml ./pretrained/anet_i3d_reproduce/
The results (mAP at tIoUs) with I3D features should be
Method | 0.5 | 0.75 | 0.95 | Avg |
---|---|---|---|---|
ActionFormer | 54.29 | 36.71 | 8.24 | 36.03 |
Download Features and Annotations
md5sum add9803756afd9a023bc9a9c547e0229
) from this Box link or this Google Drive Link or this BaiduYun Link.Details: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of 32 frames
at a frame rate of 30 fps
and a stride of 16 frames
. This gives one feature vector per 16/30 ~= 0.5333
seconds.
Unpack Features and Annotations
This folder
│ README.md
│ ...
│
└───data/
│ └───epic_kitchens/
│ │ └───annotations
│ │ └───features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
python ./train.py ./configs/epic_slowfast_verb.yaml --output reproduce
python ./train.py ./configs/epic_slowfast_noun.yaml --output reproduce
python ./eval.py ./configs/epic_slowfast_verb.yaml ./ckpt/epic_slowfast_verb_reproduce
python ./eval.py ./configs/epic_slowfast_noun.yaml ./ckpt/epic_slowfast_noun_reproduce
[Optional] Evaluating Our Pre-trained Model
We also provide a pre-trained model for EPIC-Kitchens 100. The model with all training logs can be downloaded from this Google Drive link (verb), and from this Google Drive link (noun). To evaluate the pre-trained model, please follow the steps listed below.
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───epic_slowfast_verb_reproduce/
│ │ └───epic_slowfast_verb_reproduce_log.txt
│ │ └───epic_slowfast_verb_reproduce_results.txt
│ │ └───...
│ └───epic_slowfast_noun_reproduce/
│ │ └───epic_slowfast_noun_reproduce_log.txt
│ │ └───epic_slowfast_noun_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
python ./eval.py ./configs/epic_slowfast_verb.yaml ./pretrained/epic_slowfast_verb_reproduce/
python ./eval.py ./configs/epic_slowfast_noun.yaml ./pretrained/epic_slowfast_noun_reproduce/
Method | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | Avg |
---|---|---|---|---|---|---|
ActionFormer (verb) | 26.58 | 25.42 | 24.15 | 22.29 | 19.09 | 23.51 |
ActionFormer (noun) | 25.21 | 24.11 | 22.66 | 20.47 | 16.97 | 21.88 |
Download Features and Annotations
./tools/convert_ego4d_trainval.py
.Details: All features are extracted at 1.875 fps
from videos at 30 fps
. This gives one feature vector per ~0.5333
seconds. Please refer to Ego4D and EgoVLP's documentation for more details on feature extraction.
Unpack Features and Annotations
This folder
│ README.md
│ ...
│
└───data/
│ └───ego4d/
│ │ └───annotations
│ │ └───slowfast_features
│ │ └───omnivore_features
│ │ └───egovlp_features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
python ./train.py ./configs/ego4d_omnivore_egovlp.yaml --output reproduce
tensorboard --logdir=./ckpt/ego4d_omnivore_egovlp_reproduce/logs
python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./ckpt/ego4d_omnivore_egovlp_reproduce
[Optional] Evaluating Our Pre-trained Model
We also provide pre-trained models for Ego4D trained with all feature combinations. The models with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───ego4d_omnivore_egovlp_reproduce/
│ │ └───ego4d_omnivore_egovlp_reproduce_log.txt
│ │ └───ego4d_omnivore_egovlp_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./pretrained/ego4d_omnivore_egovlp_reproduce/
Method | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | Avg |
---|---|---|---|---|---|---|
ActionFormer (S) | 20.09 | 17.45 | 14.44 | 12.46 | 10.00 | 14.89 |
ActionFormer (O) | 23.87 | 20.78 | 18.39 | 15.33 | 12.65 | 18.20 |
ActionFormer (E) | 26.84 | 23.86 | 20.57 | 17.19 | 14.54 | 20.60 |
ActionFormer (S+E) | 27.98 | 24.46 | 21.21 | 18.56 | 15.60 | 21.56 |
ActionFormer (O+E) | 27.99 | 24.94 | 21.94 | 19.05 | 15.98 | 21.98 |
ActionFormer (S+O+E) | 28.26 | 24.69 | 21.88 | 19.35 | 16.28 | 22.09 |
Method | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | Avg |
---|---|---|---|---|---|---|
ActionFormer (S) | 52.25 | 45.84 | 40.60 | 36.58 | 31.33 | 41.32 |
ActionFormer (O) | 54.63 | 48.72 | 43.03 | 37.76 | 33.57 | 43.54 |
ActionFormer (E) | 59.53 | 54.39 | 48.97 | 42.75 | 37.12 | 48.55 |
ActionFormer (S+E) | 59.96 | 53.75 | 48.76 | 44.00 | 38.96 | 49.09 |
ActionFormer (O+E) | 61.03 | 54.15 | 49.79 | 45.17 | 39.88 | 49.99 |
ActionFormer (S+O+E) | 60.85 | 54.16 | 49.60 | 45.12 | 39.87 | 49.92 |
Work in progress. Stay tuned.
Yin Li (yin.li@wisc.edu)
If you are using our code, please consider citing our paper.
@inproceedings{zhang2022actionformer,
title={ActionFormer: Localizing Moments of Actions with Transformers},
author={Zhang, Chen-Lin and Wu, Jianxin and Li, Yin},
booktitle={European Conference on Computer Vision},
series={LNCS},
volume={13664},
pages={492-510},
year={2022}
}
If you cite our results on Ego4D, please consider citing our tech report in addition to the main paper.
@article{mu2022actionformerego4d,
title={Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge},
author={Mu, Fangzhou and Mo, Sicheng and Wang, Gillian, and Li, Yin},
journal={arXiv e-prints},
year={2022}
}
If you are using TSP features, please cite
@inproceedings{alwassel2021tsp,
title={{TSP}: Temporally-sensitive pretraining of video encoders for localization tasks},
author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
pages={3173--3183},
year={2021}
}