Some questions about the process of 'joint_img[:, 2]'

mks0601 / 3DMPPE_POSENET_RELEASE

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019

MIT License

816 stars 147 forks source link

Some questions about the process of 'joint_img[:, 2]' #30

Closed karenyun closed 4 years ago

karenyun commented 4 years ago

Hi, when doing the data augment, you normalized the depth to -1~1 by diving bbox_3d_shape[0] as the "joint_img[i, 2] /= (cfg.bbox_3d_shape[0]/2.) # expect depth lies in -bbox_3d_shape[0]/2 ~ bbox_3d_shape[0]/2 -> -1.0 ~ 1.0", does it mean the depths in camera space are in the range of [-1000, 1000]? And why?

Then you did '*joint_img[:, 2] = joint_img[:, 2] cfg.depth_dim**', why did you do this?

mks0601 commented 4 years ago

In order to define voxel space, there should be range for each x, y, and z axis. For the x and y axis, I use image coordinate, and the range of the x and y space is bounding box space, which is clear. For the z axis, I assumes human depth value from the root joint would be in -1000mm to 1000mm, which is reasonable assumption.

joint_img[:, 2] = joint_img[:, 2] * cfg.depth_dim makes depth value from 0-1 to 0-(cfg.depth_dim-1)

karenyun commented 4 years ago

Thanks for your quick reply~ I thought it was the position values regression as the RootNet, so for the z axis, you consider the root as the center, the depth distance of other joints relative to root would not exceed 1000mm? Then you use 'cfg.depth_dim' to represent the max distance 2000mm? Am I correct?

mks0601 commented 4 years ago

Yes. This is PoseNet repo, and as you can read in the paper, the PoseNet only estimates root joint-relative 3D pose. The output of the RootNet can be combined to get absolute 3D pose.

karenyun commented 4 years ago

Thanks very much! From the code, I only find the GT of depth z is relative to root, the GT of x and y coordinates are in the image space rather than relative to the root.

But the prediction of x, y, you use the formulation '*pred_2d_kpt[:,0] = pred_2d_kpt[:,0] / cfg.output_shape[1] bbox[2] + bbox[0]**', it seems that you translate the predition x and y to bbox-space? Maybe I miss something to understand why you do this translation?

In addition, where you combine the output of the RootNet with the output of the PoseNet? Does it do in the off-line?

mks0601 commented 4 years ago

pred_2d_kpt[:,0] = pred_2d_kpt[:,0] / cfg.output_shape[1] * bbox[2] + bbox[0] resize and translate from cropped image space (bbox) to the original image space.

Integrating the RootNet output is done in evaluation process such as line 90, 129, and 174 of https://github.com/mks0601/3DMPPE_POSENET_RELEASE/blob/master/data/Human36M/Human36M.py

karenyun commented 4 years ago

Oh, sorry I miss something in the code. Thanks very much!

karenyun commented 4 years ago

Hi, sorry to bother you again, can I ask other questions about the 3D pose prediction that are not related to this paper and code?

I confuse about the root-relative GT of 3D pose. In this paper, you consider the x,y are still in the image space and only the z is root-relative. But in other papers, like the "A simple yet effective baseline for 3d human pose estimation", they zero-centre the 3d pose around the hip joint. I am curious about whether the x and y of other joints are all relative to the root joint in camera space or not? If yes, it can directly compute Protocol #1 without transforming the prediction to camera space?

And I also have a question about the annotations of Human3.6M. I think there are 4 cameras to capture the action of one person, for example, one frame in 4 cameras will produce 4 different 2D images, are those annotations of the same joint for different 2D images be same, i.e. the same world space position?

Could you give me some advice? Thanks very much!

mks0601 commented 4 years ago

Yes, there are two types of 3D human pose methods, which estimate root-relative 3D pose of 1) (x_img, y_img, Z_cam), 2) (X_cam, Y_cam, Z_cam). Usually, 1) estimates (x_img, y_img) using heatmap, which gives better performance because in case the heatmaps are image-aligned and does not require fully connected layers.
Of course they have the same world coordinates, but different camera coordinate and image coordinates.

karenyun commented 4 years ago

Many thanks for solving my confusion!!!