This repository is an official PyTorch implementation of the paper "Learnable Triangulation of Human Pose" (ICCV 2019, oral). Here we tackle the problem of 3D human pose estimation from multiple cameras. We present 2 novel methods — Algebraic and Volumetric learnable triangulation — that outperform previous state of the art.
If you find a bug, have a question or know to improve the code - please open an issue!
:arrow_forward: ICCV 2019 talk
This project doesn't have any special or difficult-to-install dependencies. All installation can be done with:
pip install -r requirements.txt
Sorry, only Human3.6M dataset training/evaluation is available right now. We cannot add CMU Panoptic, sorry for that.
./data/pretrained/human36m/pose_resnet_4.5_pixels_human36m.pth
(ResNet-152 trained on COCO dataset and finetuned jointly on MPII and Human3.6M).In this section we collect pretrained models and configs. All pretrained weights and precalculated 3D skeletons can be downloaded at once from here and placed to ./data/pretrained
, so that eval configs can work out-of-the-box (without additional setting of paths). Alternatively, the table below provides separate links to those files.
Human3.6M:
Model | Train config | Eval config | Weights | Precalculated results | MPJPE (relative to pelvis), mm |
---|---|---|---|---|---|
Algebraic | train/human36m_alg.yaml | eval/human36m_alg.yaml | link | train, val | 22.5 |
Volumetric (softmax) | train/human36m_vol_softmax.yaml | eval/human36m_vol_softmax.yaml | link | — | 20.4 |
Every experiment is defined by .config
files. Configs with experiments from the paper can be found in the ./experiments
directory (see model zoo).
To train a Volumetric model with softmax aggregation using 1 GPU, run:
python3 train.py \
--config experiments/human36m/train/human36m_vol_softmax.yaml \
--logdir ./logs
The training will start with the config file specified by --config
, and logs (including tensorboard files) will be stored in --logdir
.
Multi-GPU training is implemented with PyTorch's DistributedDataParallel. It can be used both for single-machine and multi-machine (cluster) training. To run the processes use the PyTorch launch utility.
To train a Volumetric model with softmax aggregation using 2 GPUs on single machine, run:
python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=2345 \
train.py \
--config experiments/human36m/train/human36m_vol_softmax.yaml \
--logdir ./logs
To watch your experiments' progress, run tensorboard:
tensorboard --logdir ./logs
After training, you can evaluate the model. Inside the same config file, add path to the learned weights (they are dumped to logs
dir during training):
model:
init_weights: true
checkpoint: {PATH_TO_WEIGHTS}
Also, you can change other config parameters like retain_every_n_frames_test
.
Run:
python3 train.py \
--eval --eval_dataset val \
--config experiments/human36m/eval/human36m_vol_softmax.yaml \
--logdir ./logs
Argument --eval_dataset
can be val
or train
. Results can be seen in logs
directory or in the tensorboard.
MPJPE relative to pelvis:
MPJPE (averaged across all actions), mm | |
---|---|
Multi-View Martinez [4] | 57.0 |
Pavlakos et al. [8] | 56.9 |
Tome et al. [4] | 52.8 |
Kadkhodamohammadi & Padoy [5] | 49.1 |
Qiu et al. [9] | 26.2 |
RANSAC (our implementation) | 27.4 |
Ours, algebraic | 22.4 |
Ours, volumetric | 20.5 |
MPJPE absolute (scenes with invalid ground-truth annotations are excluded):
MPJPE (averaged across all actions), mm | |
---|---|
RANSAC (our implementation) | 22.8 |
Ours, algebraic | 19.2 |
Ours, volumetric | 17.7 |
MPJPE relative to pelvis (single-view methods):
MPJPE (averaged across all actions), mm | |
---|---|
Martinez et al. [7] | 62.9 |
Sun et al. [6] | 49.6 |
Ours, volumetric single view | 49.9 |
MPJPE relative to pelvis [4 cameras]:
MPJPE, mm | |
---|---|
RANSAC (our implementation) | 39.5 |
Ours, algebraic | 21.3 |
Ours, volumetric | 13.7 |
We present 2 novel methods of learnable triangulation: Algebraic and Volumetric.
Our first method is based on Algebraic triangulation. It is similar to the previous approaches, but differs in 2 critical aspects:
For the most popular Human3.6M dataset, this method already dramatically reduces error by 2.2 times (!), compared to the previous art.
In Volumetric triangulation model, intermediate 2D feature maps are densely unprojected to the volumetric cube and then processed with a 3D-convolutional neural network. Unprojection operation allows dense aggregation from multiple views and the 3D-convolutional neural network is able to model implicit human pose prior.
Volumetric triangulation additionally improves accuracy, drastically reducing the previous state-of-the-art error by 2.4 times! Even compared to the best parallelly developed method by MSRA group, our method still offers significantly lower error of 21 mm.
@inproceedings{iskakov2019learnable,
title={Learnable Triangulation of Human Pose},
author={Iskakov, Karim and Burkov, Egor and Lempitsky, Victor and Malkov, Yury},
booktitle = {International Conference on Computer Vision (ICCV)},
year={2019}
}