mks0601 / 3DMPPE_ROOTNET_RELEASE

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019
MIT License
477 stars 65 forks source link

Combination Strategy for Root + Pose #18

Closed usamahjundia closed 4 years ago

usamahjundia commented 4 years ago

Hello! Thanks in advance for this implementation of a cool paper! Currently i am trying to combine both rootnet and posenet into a singular pipeline for inference into 3D. What i want to ask is, what is the combination strategy for both? What do we do with both information? How do we extract crop from posenet to input into rootnet? What if there is a problem with occlussion, etc? Given the paper uses Mask-RCNN to first extract individual humans, do we mask out the unnecessary regions from a singular crop?

Then what do we do with posenet and rootnet prediction? Do we just merely add them together? Because judging from the results in the paper, the x,y,z produced by posenet and rootnet is in pixels, right?

Also, rootnet requires a pre-computed K value, how did you get this value?

Thanks!

mks0601 commented 4 years ago

There are too many questions and actually, answers of most of the questions are in my paper. Could you reduce the number of questions after reading the paper?

usamahjundia commented 4 years ago

No, i quite understand the gist of some questions actually (which is pretty much explained clearly in the paper, thanks for that), but not the technical aspect of doing so. Here are the reduced actual questions:

  1. Did you just crop a bounding box out of detection (the paper mentioned this), or do you crop the bounding box and mask the image (this remains unclear)? does not doing the latter do well for 2 persons occluding one another?
  2. What is the dimension of the correction factor of the rootnet? the dimension of the computed K is in terms of real-world length unit (mm,cm,inch etc), but what i observe in the testing protocol, you directly find the Distance between the GT "depth" or distance from camera. Yes it undergoes some transformations, but in inspecting the code, i see that the Z value remains unchanged. Short question is, rootnet outputs x,y and Z. Is x and y in pixels and Z in length unit?
  3. If (2) is true, how do we convert Z back to pixels, or whatever unit is suitable for input? If i know the dimensionality of the image censor and the actual pixel range for the image, can i use this info to do so?

Thank you. I apologize for asking too many questions as i asked the questions while reading the paper and i really need a lot of iteration in reading the paper and as you observe, i edited the questions several times as i got answers in an iteration of reading.

mks0601 commented 4 years ago
  1. I cannot understand what you mean by 'masking', but I just really 'crop and resize' the bounding box area from the original image.
  2. The dimension of the correction factor is 1 because that is scalar. I think you are interested in unit, and that is described in paper. The RootNet outputs (x,y) image coordinates of human root joint and correction factor gamma has no unit. K is in mm.
  3. What do you mean by convert Z back to pixel? pixel is defined in xy space, not z space.
usamahjundia commented 4 years ago

What i meant by the 3rd point is to represent the depth/distance (the corrected K value) in pixel dimension. But it seems like i found some insight in the code. I'm looking at this line here in MSCOCO.py:

pred_2d_kpt = np.take(pred_2d_kpt, self.eval_joint, axis=0)
pred_2d_kpt[:,0] = pred_2d_kpt[:,0] / cfg.output_shape[1] * bbox[2] + bbox[0]
pred_2d_kpt[:,1] = pred_2d_kpt[:,1] / cfg.output_shape[0] * bbox[3] + bbox[1]
pred_2d_kpt[:,2] = (pred_2d_kpt[:,2] / cfg.depth_dim * 2 - 1) * (cfg.bbox_3d_shape[0]/2) + gt_3d_root[2]

Seems like PoseNet outputs all x,y,z in pixel unit, and this is the conversion into length unit.

So is this statement true:

In 3D pixel coordinates, humans are surrounded by a 3D bounding box of size 256x256x256 (or 64?) and in real world coordinates of size 2x2x2 meters, and the processing on the third dimension of posenet outputs is the mapping from this 256^3 in pixel space to 2^3 in length unit (meter) space?

(Yes, i am aware this isnt the repo for posenet, but starting a new issue in the other repo just to get a related question answered does not seem effective)

mks0601 commented 4 years ago

First of all, depth cannot be converted to pixel. Depth is defined in z-axis, and pixel is defined in x- and y-axis. PoseNet outputs x,y in image coordinate space and z in discretized camera-centered space. Regarding your question, there is no 3D pixel coordinate. Do you mean voxel? If that is the case, the answer is close to yes. I cannot understand 'processing on the third dimension'

usamahjundia commented 4 years ago

by processing on the third dimension i meant this line

pred_2d_kpt[:,2] = (pred_2d_kpt[:,2] / cfg.depth_dim * 2 - 1) * (cfg.bbox_3d_shape[0]/2) + gt_3d_root[2]

and okay, i obviously failed trying to frame it in academic terms so let's reframe: let's say i want to visualize or export the 3D coordinates of a pose, for example, so that i can use it to animate a 3D model of a character. How do i make sure the x,y, and z coordinates are consistent, as in, movements along the x axis and y axis (which is in posenet outputs, in terms of image space which i have been failing to find the right term for) is in the same scale / unit as the z axis?

mks0601 commented 4 years ago

L168 of https://github.com/mks0601/3DMPPE_POSENET_RELEASE/blob/master/data/MSCOCO/MSCOCO.py

pixel2cam converts (x_img, y_img, z_cam) to (x_cam, y_cam, z_cam)