Closed tinatiansjz closed 3 years ago
And the joints for estimation are predicted by the network, instead of the regressed ones obtained by using the pre-defined regression matrix. Have I got this right?
Q&A1: Unfortunately we didn't try ResNet50 backbone on 3DPW dataset. In our early experiments, we were using Human3.6M for the ablation study of the use of different backbones. We found HRNet gives better results, so we use HRNet for the rest of the experiments.
Q&A2: Yes. We mainly evaluate the 3D joints which are regressed from the 3D mesh via the pre-defined regression matrix. This is because we want to understand the 3D pose of the estimated 3D mesh. In fact, in our early explorations, we have tried to evaluate the 3D joints which are directly predicted by the network. It actually gives very similar results (about ~0.1 mPJPE improvement).
Just want to add more comments about Q2.
Since the 3D joints we used are computed from 3D mesh, you may wonder what if we use transformer to just predict 3D vertices. In fact, in our early explorations, we have tried to use transformer to predict vertices only. But the training was not converging well.
To make it converging, we found that we need to have both joint and vertex queries, and train our transformer to predict both joints and vertices. We think this is probably because in this approach, we can better leverage self-attention mechanism to directly learn non-local interactions between them, which leads to further improvements.
Thank you for your clear explanation! My confusion has been dispelled :)
Hi, I'm curious about the quantitative performance of METERO with ResNet50 backbone on the 3DPW dataset, since the official repo doesn't provide the pre-trained models with ResNet50. I'd be grateful if any advice was given.