mks0601 / 3DMPPE_POSENET_RELEASE

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019
MIT License
807 stars 147 forks source link

General question about data #105

Open ttdd11 opened 2 years ago

ttdd11 commented 2 years ago

Amazing repo and very interesting work. I have some questions about the data formatting to introduce 3D data. This may sound very simple so I apologize in advance.

When building a data batch, joints are pixel space (x,y) are taken in addition to their root relative location (location - root) in camera coordinates. So for the root the z = 0. I'm assuming this z is in mm.

Then these data are mapped to the output space, by normalizing the depth dimension from 0-1 (with 0 being -1000 mm, and 1 being +1000 mm), then multiplying by the output space, being 64 (in the depth). The 64 being the heatmap depth dimension.

So on on the forward pass during inference, the output heatmaps in the depth are multiplied back to mm using the opposite conversion, and then corrected according to the true root (calculating using rootnet).

If this is all true (which it may not be, please correct me where wrong), how is the depth accommodated with smaller versus larger images/people? Is this implicit in the training?

Thank you again for your amazing contribution.

mks0601 commented 2 years ago

Your descriptions are all correct. As all persons are cropped and resized using their bounding boxes, they have almost the same scale in (x,y) pixel space On the other hand, they have different depth as depth is a discretized milimeter.

In the inference stage, we apply an inverse affine transformation from cropped and resized human bounding box to the original image space. Here, if small and large person have similar camera depth, they might have different (x,y) scale. Then, we back-project (x,y) pixel + z milimeter to (x,y,z) milimeter.