microsoft / MeshTransformer

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"
https://arxiv.org/abs/2012.09760
MIT License
614 stars 95 forks source link

the estimation result on 3DPW using ResNet50 backbone #33

Closed tinatiansjz closed 3 years ago

tinatiansjz commented 3 years ago

Hi, I'm curious about the quantitative performance of METERO with ResNet50 backbone on the 3DPW dataset, since the official repo doesn't provide the pre-trained models with ResNet50. I'd be grateful if any advice was given.

tinatiansjz commented 3 years ago

And the joints for estimation are predicted by the network, instead of the regressed ones obtained by using the pre-defined regression matrix. Have I got this right?

kevinlin311tw commented 3 years ago

Q&A1: Unfortunately we didn't try ResNet50 backbone on 3DPW dataset. In our early experiments, we were using Human3.6M for the ablation study of the use of different backbones. We found HRNet gives better results, so we use HRNet for the rest of the experiments.

Q&A2: Yes. We mainly evaluate the 3D joints which are regressed from the 3D mesh via the pre-defined regression matrix. This is because we want to understand the 3D pose of the estimated 3D mesh. In fact, in our early explorations, we have tried to evaluate the 3D joints which are directly predicted by the network. It actually gives very similar results (about ~0.1 mPJPE improvement).

kevinlin311tw commented 3 years ago

Just want to add more comments about Q2.

Since the 3D joints we used are computed from 3D mesh, you may wonder what if we use transformer to just predict 3D vertices. In fact, in our early explorations, we have tried to use transformer to predict vertices only. But the training was not converging well.

To make it converging, we found that we need to have both joint and vertex queries, and train our transformer to predict both joints and vertices. We think this is probably because in this approach, we can better leverage self-attention mechanism to directly learn non-local interactions between them, which leads to further improvements.

tinatiansjz commented 3 years ago

Thank you for your clear explanation! My confusion has been dispelled :)