mks0601 / V2V-PoseNet_RELEASE

Official Torch7 implementation of "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map", CVPR 2018
https://arxiv.org/abs/1711.07399
MIT License
377 stars 69 forks source link

Change output heatmap's resolution up to 88x88x88 #33

Closed dragonbook closed 5 years ago

dragonbook commented 5 years ago

Hi, I want to change the network's output resolution up to 88x88x88(that is, double the current heatmap's size) to simply enlarge estimation precision. Since your current network is pretty good now, I chose to simply adjust it by inserting one more decoder(with a upsample layer) block after original encoder-decoder block to double it's output(which is 44x44x44). (I also tried to add a longer skip/residual connection at original scale(88x88x88)). But they seems not to work much well like your original one in my experiments(emm..., actually some of them do work).

I wonder how do you designed your current network, except the common practices, like residual block and U-net like skip-connection. E.g. you used U-net style in a encoder-decoder sub-block in middle of the architecture after one basic conv layer, one pool layer and some residual blocks. What's your considerations?

Besides, Did you consider feature map cell's receptive field(in order to catch larger 3d context) when you design network? Did you try some experiments/network designs on 88x88x88 output resolution? Could you talk some experience or give me some suggestions?

Thanks!

mks0601 commented 5 years ago

Hi dragonbook,

I experimented your case and it does not improve the performance much, while consuming computational cost much more. When I designed my model, I considered receptive field size and conventional U-Net structure stuffs.

I think you have to set sigma value used to generate gt heatmap larger than the original one. Otherwise, it would generate too small blob on the gt heatmap (because of enlarged output heatmap size), so the model would have a difficulty to learning to localize hand keypoints.