We will run B-KinD on your videos by request! Please fill out https://forms.gle/k5gwNxws8xMEEzUe6.
Implementation from the paper:
Jennifer J. Sun, Serim Ryou, Roni Goldshmid, Brandon Weissbourd, John Dabiri, David J. Anderson, Ann Kennedy, Yisong Yue, Pietro Perona, Self-Supervised Keypoint Discovery in Behavioral Videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022
B-KinD discovers keypoints without the need for bounding box annotations or manual keypoint, and works on a range of organisms including mice, humans, flies, jellyfish and tree. The discovered keypoints and additional shape features are directly applicable to downstream tasks including behavior classification and pose regression!
Our code currently supports running keypoint discovery on your own videos, where there's relatively stationary background and no significant occlusion. Features in progress (let us know in the issues if there's anything else you'd like to see!):
Follow these instructions if you would like to quickly try out training B-KinD on CalMS21 and Human 3.6M and using discovered keypoints in downstream tasks. Please see these instructions on setting up a new dataset to apply B-KinD on your own dataset.
To set up the environment, you can use the provided yml file with conda env create -f bkind_env.yml
Alternatively to using the yml file, here are all the install commands we used (you may need to change the cudatoolkit version depending on your GPU):
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
conda install -c anaconda pyyaml
conda install -c anaconda h5py
pip install piq
pip install opencv-python
python train_video.py --config config/CalMS21.yaml
This will take 1~2 days on a single GPU, the keypoints converge around epoch 10.
One you are done training keypoint discovery model, evaluate it on the CalMS21 task 1 data on the default split. The classification code is provided by [1]: https://gitlab.aicrowd.com/aicrowd/research/mab-e/mab-e-baselines
To evaluate using our keypoints, first extract keypoints for all images in CalMS21 task 1:
python extract_features.py --train_dir [images from train split CalMS21] --test_dir [images from test split CalMS21]
--resume [checkpoint path] --output_dir [output directory to store keypoints]
Convert the extracted keypoints to the npy format for the CalMS21 baseline code:
python convert_discovered_keypoints_to_classifier.py --input_train_file [calms21_task1_train.json from step 1b) above] --input_test_file [calms21_task1_test.json from step 1b) above] --keypoint_dir_train [keypoints_confidence/train extracted in previous step] --keypoint_dir_test [keypoints_confidence/test extracted in previous step]
The data is automatically saved into the data/ directory.
Use the saved npy files in data in the CalMS21 classifier code (train_task1.py). Note that the default code from [1] only handles input keypoints. We provide a modified version that reads in keypoints, confidence and covariance in the classifier/ directory.
python classifier/train_task1.py **
** Need to set train_data_path and test_data_path inside the train_task1.py file to the files generated from the previous step.
python train_video.py --config config/H36M.yaml
python test_simplehuman36m.py [path_to_simplified_h36m_dataset**] --checkpoint [path_to_model_directory] --batch-size [N] --gpu 0 --nkpts [K] --resume [path_to_model_file]
*Note that [path_to_h36m_dataset] should end with 'processed' directory **[path_to_simplified_h36m_dataset] should end with 'human_images' directory
Regression results may vary since our method does not use any keypoint label as a supervision signal while training the keypoint discovery model.
Please follow the instructions below if you would like to train B-KinD on your own video dataset! Note that the code currently does not support: tracking for multiple agents with similar appearance, videos with a lot of background motion, and/or videos with a lot of occlusion or self-occlusion.
Note that one important consideration is the frame_gap
parameter in the config directory, you want to pick a gap where the agent has moved a bit, but not too much (usually this is under 1 second, we used frame gap corresponding to 0.2s in our work). Also adjust nkpts
to the number of parts you want to track.
Extract frames and put all the frames in the "data/custom_dataset" directory. There are many ways to extract frames - for example, using ffmpeg:
ffmpeg -i [input_video_name] [output_directory]/images_%08d.png
Directory structure should follow the format below:
data/custom_dataset
|---train
|---video1
|---image0.png
|---image1.png
|--- .
|--- .
|---video2
|--- .
|--- .
|--val
|---videoN
*Image file names should be in an ascending order for sampling pair of images sequentially
Set up the configuration of your data in config/custom_dataset.yaml
Run command
python train_video.py --config config/custom_dataset.yaml
Use the --visualize
flag to visualize examples during training. This can be helpful to see when the keypoints have converged.
To extract additional features from the discovered heatmap, run command
python extract_features.py --train_dir [images from train split] --test_dir [images from test split]
--resume [checkpoint path] --output_dir [output directory to store keypoints] --imsize [training image size] --nkpts [number of discovered keypoints]
Please refer to our paper for details and consider citing it if you find the code useful:
@article{bkind2021,
title={Self-Supervised Keypoint Discovery in Behavioral Videos},
author={Sun, Jennifer J and Ryou, Serim and Goldshmid, Roni and Weissbourd, Brandon and Dabiri, John and Anderson, David J and Kennedy, Ann and Yue, Yisong and Perona, Pietro},
journal={arXiv preprint arXiv:2112.05121},
year={2021}
}
Our code is available under the Apache License 2.0.
[1] Sun et al., The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions. NeurIPS 2021.
[2] Zhang et al., Unsupervised discovery of object landmarks as structural representations. CVPR 2018.