This repo contains reimplementations of few-shot action recognition methods in Pytorch using a shared codebase, as they tend not to have public code. These are not the official versions from the original papers/authors.
I intend to keep it up to date so there's a common resource for people interested in this topic, and it should be a good codebase to start from if you want to implement your own method.
Feature/method/pull requests are welcome, along with any suggestions, help or questions.
I've chosen not to support Kinetics because the full dataset doesn't exist (videos are continually removed from youtube/marked as private) meaning results aren't repeatable, and it's a pain to download the videos which are still there as youtube can randomly block you scraping. Additionaly, it's not a very good test of few-shot action recognition methods as classes can be distinguished by appearance alone, which means it doesn't test temporal understanding.
Conda is recommended.
To use a ResNet 50 backbone you'll need at least a 4 x 11GPU machine. You can fit everything all on one GPU using a ResNet 18 backbone.
Download the datasets from their original locations:
Once you've downloaded the datasets, you can use the extract scripts to extract frames and put them in train/val/test folders. You'll need to modify the paths at the top of the scripts. To remove unnecessary frames and save space (e.g. just leave 8 uniformly sampled frames), you can use shrink_dataset.py. Again, modify the paths at the top of the sctipt.
Use run.py. Example arguments for some training runs are in the scripts folder. You might need to modify the distribute functions in model.py to suit your system depending on your GPU configuration.
Inherit the class CNN_FSHead in model.py, and add the option to use it in run.py. That's it! You can see how the other methods do this in model.py.
If you find this code helpful, please cite the paper this code is based on:
@inproceedings{perrett2021trx,
title = {Temporal Relational CrossTransformers for Few-Shot Action Recognition}
booktitle = {Computer Vision and Pattern Recognition}
year = {2021}}
And of course the original papers containing the respective methods.
We based our code on CNAPs (logging, training, evaluation etc.). We use torch_videovision for video transforms.