gsig / charades-algorithms

Activity Recognition Algorithms for the Charades Dataset
201 stars 96 forks source link
activity-recognition charades lstm pytorch two-stream video

Charades Starter Code for Activity Recognition in Torch and PyTorch

Contributor: Gunnar Atli Sigurdsson

New: extension of this framework to the deep CRF model on Charades for Asynchronous Temporal Fields for Action Recognition:

See pytorch/, torch/, for the code repositories.

The code replicates the 'Two-Stream Extended' and 'Two-Stream+LSTM' baselines found in:

author = {Gunnar A. Sigurdsson and Santosh Divvala and Ali Farhadi and Abhinav Gupta},
title = {Asynchronous Temporal Fields for Action Recognition},
booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
pdf = {},
code = {},

which is in turn based off "Two-stream convolutional networks for action recognition in videos" by Simonyan and Zisserman, and "Beyond Short Snippets: Deep Networks for Video Classification" by Joe Yue-Hei Ng el al.

Combining the predictions (submission files) of those models using yields a final classification accuracy of 18.9% mAP (Two-Stream) and 19.8% (LSTM) on Charades (evalated with charades_v1_classify.m)

Technical Overview:

The code is organized such that to train a two-stream network. Two independed network are trained: One RGB network and one Flow network. This code parses the training data into pairs of an image (or flow), and a label for a single activity class. This forms a softmax training setup like a standard CNN. The network is a VGG-16 network. For RGB it is pretrained on Image-Net, and for Flow it is pretrained on UCF101. The pretrained networks can be downloaded with the scripts in this directory. For testing. The network uses a batch size of 25, scores all images, and pools the output to make a classfication prediction or uses all 25 outputs for localization.