princeton-vl / pytorch_stacked_hourglass

Pytorch implementation of the ECCV 2016 paper "Stacked Hourglass Networks for Human Pose Estimation"
BSD 3-Clause "New" or "Revised" License
465 stars 94 forks source link

evaluation #13

Closed zenghy96 closed 4 years ago

zenghy96 commented 4 years ago

To generate heatmaps on low resolution, the ground truth was transfomed (like affine transformation) to Pts. But if we tranform the Pts back to original resolution, there are some deviations (about 5-8 pixels) between ground truth and Pts. The network output was learned from Pts, which were difference from ground truth already, did it infulence the precision final result? Waiting for your reply! THX!

crockwell commented 4 years ago

I believe you are essentially correct: the model is trained on heatmaps (generated from ground truth) at lower resolution, and this probably causes very slight loss of keypoint prediction precision. The hourglass structure works at this resolution for input and output as well, and could be modified to use higher resolution at the cost of (a lot of) memory. Considering the model is pretty memory intensive, and adding more "stacks" helps results, the sacrifice in lower precision I think seemed justified given other concerns and the dataset.

I personally did not experiment with higher resolution heatmap training, but it is possible this could make a small difference in results. Ground truths should not be biased due to this downsampling however, so I would assume the possible gains are pretty small, at least in the MPII setting. I'm curious to hear what sort of results you could get running this ablation!