mks0601 / 3DMPPE_POSENET_RELEASE

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019
MIT License
824 stars 147 forks source link

A question about MUPOTS dataset #12

Closed gh18l closed 5 years ago

gh18l commented 5 years ago

I download the annotation file about MUPOTS dataset from your link, the 2D coordination is in keypoints_img, and the corresponding x and y of 3D is the same as keypoints_img, the z comes from keypoints_cam-z. But I found the z is misplaced with real depth position in the corresponding image. Are there any extra steps to do?

mks0601 commented 5 years ago

what do you mean by misplaced? keypoints_img contains 2D image coordinates. keypoints_cam contains camera-centered 3D coordinates.

gh18l commented 5 years ago

@mks0601 Thank you for your reply. For example, the keypoints_img is (x1, y1), keypoints_cam is (x2, y2, z2). The 3D coordination with image-centered(camera-centered point multiplies intrinsic matrix) is (x1, y1, z2). I picked pairs(TS3-000000.jpg) and I found the three persons in the image have different depth in Pelvis joint, but the data z2 with image-centered in Pelvis joint is the same almost.

mks0601 commented 5 years ago

The image projection (x2,y2,z2) -> (x1,y1) is effected by z2. Could you check the projected (x1,y1) shows valid 2D pose on the image? Then, there is no problem. I didn't do anything from provided original dataset, but just converted it to .json MS COCO format.

gh18l commented 5 years ago

@mks0601 Thank you. The number of persons in MUPOTS is not fixed, but I ignored it.

gh18l commented 5 years ago

Can I ask you another question? If the size and position of the person in images of MUCO and MUPOTS is same absolutely, the x1 or y1 of image-centered human joint(x1,y1,z2) in MUCO and MUPOTS is equal obviously, but is it equal about z2 in MUCO and MUPOTS? I mean can I evaluate the MUPOTS joint using image-centered point(x1,y1,z2) while training in MUCO using image-centered point(x1,y1,z2)(using absolute depth z2 that don't minus root)?

mks0601 commented 5 years ago

I cannot clearly understand your question, but let me explain my answer based on my guess.

  1. If the size and position of the person in images of different dataset, then image-centered human joints (x,y) would be the same? Ans: No. It can also depends on image resolution and poses that humans are performing.

  2. If the size and position of the person in images of different dataset, then distance between camera and each joints would be the same? Ans: No. It can also depends on focal length and poses that humans are performing.

gh18l commented 5 years ago

Thank you for your reply. Suppose I have the image-centered and camera centered joint which your link for MUCO and MUPOTS datasets offer, can you tell me how can I evaluate the MUPOTS while training with MUCO(input:2D joint-->output:3D joint)?

I have tried using image-centered joint(x,y,z) directly(training input:2D image-centered joint(x,y)-->training output:3D image-centered joint(x,y,z)---------evaluating input:2D image-centered joint(x,y)-->evaluating output:3D image-centered joint(x,y,z), abviously it only predict z), but I found it seems wrong. The 2D inputs have been normalized into [-1, 1] with formulation-->(x,y)/img_width *2 - (1.0,img_height/img_width) to eliminate the influence of different resolution.

Thank you!

gh18l commented 5 years ago

Sorry for my ambiguous question...I just want to ask:

  1. What is the difference between "univ_annot3" and "annot3" in MUCO and MUPOTS dataset?
  2. How to process and use the MUCO and MUPOTS dataset if I want to train with MUCO and evaluate with MUPOTS note: The input of my system is 2D joint, the output is 3D joint.

Thank you very much!

mks0601 commented 5 years ago
  1. I'm not the author of MuCO paper, so I don't know exactly. I asked to authors, but cannot get reply. I just used annot3 as camera-centered coordinates.
  2. What do you mean by process?
gh18l commented 5 years ago

@mks0601 I mean the two datasets have some gaps, such as the different intrinsic matrix. What should my system input and output? Just like annot3[:2] as input, annot3 as output, or img_coordination as input, annot3 as output, or other extra processing?

note: I have tried the former two modes, but it doesn't seem to work.....

mks0601 commented 5 years ago

Most of 2D-3D pose lifting methods uses 2D image coordinate as input and 3D root-relative camera-centered coordinates as groundtruth (target). For more detailed things, you'd better read multi-view geometry for computer vision textbook.

gh18l commented 5 years ago

@mks0601 But if the intrinsic matrix is different, suppose there are two same 2D image points, it will get different outputs with the same network.

mks0601 commented 5 years ago

Let's say there is a person in 3D space and capture him using two different camera (with different intrinsic matrix, same extrinsic matrix). As extrinsic matrix is the same, camera-centered 3D coordinates of his keypoints would be the same. As intrinsic matrix is different, their 2D pose in image space would be different (mainly, scale). Then, you can crop the 2D pose using bounding box and after that, marginal difference can be ignored.

gh18l commented 5 years ago

@mks0601 Thank you, I will try it.