microsoft / multiview-human-pose-estimation-pytorch

This is an official Pytorch implementation of "Cross View Fusion for 3D Human Pose Estimation, ICCV 2019".
MIT License
541 stars 89 forks source link

A problem in "Generalization to New Camera Setups" #37

Open lsvery666 opened 3 years ago

lsvery666 commented 3 years ago

You give us three comparable experiments in "7.5 Generalization to New Camera Setups". First one, directly use RPSM to obtain 3D pose and the error is 109 mm. Second one, use 2D estimator's output as pseudo labels to train the network without fusion layer and the error decreases to 61 mm. Third one, add fusion layer and the error is 43 mm. I'm wondering why the latter two experiments have better performance than the first one. Since the latter two just use pseudo labels as ground truth to train the network, the first one directly use this ground truth. I mean a network should be more likely to give a output close to ground truth, but it should never reach the level of ground truth. So how could the first experiment using ground truth has worse performance than experiments which use a output just close to ground truth?

CHUNYUWANG commented 3 years ago

The first experiment gets a large error because 2D pose estimates are not accurate (the model was trained on MPII rather than H36M). Note that we don't use GT in this experiment. The second experiment decreases the error notably because the model sees images from H36M (it uses multi view images to obtain pseudo labels). As a result, 2D pose estimates become more accurate.