A problem in "Generalization to New Camera Setups"

microsoft / multiview-human-pose-estimation-pytorch

This is an official Pytorch implementation of "Cross View Fusion for 3D Human Pose Estimation, ICCV 2019".

MIT License

541 stars 89 forks source link

You give us three comparable experiments in "7.5 Generalization to New Camera Setups". First one, directly use RPSM to obtain 3D pose and the error is 109 mm. Second one, use 2D estimator's output as pseudo labels to train the network without fusion layer and the error decreases to 61 mm. Third one, add fusion layer and the error is 43 mm. I'm wondering why the latter two experiments have better performance than the first one. Since the latter two just use pseudo labels as ground truth to train the network, the first one directly use this ground truth. I mean a network should be more likely to give a output close to ground truth, but it should never reach the level of ground truth. So how could the first experiment using ground truth has worse performance than experiments which use a output just close to ground truth?

microsoft / multiview-human-pose-estimation-pytorch

A problem in "Generalization to New Camera Setups" #37