mks0601 / 3DMPPE_ROOTNET_RELEASE

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019
MIT License
477 stars 65 forks source link

Testing without access to camera intrinsic parameters #19

Closed henryyuanheng-wang closed 4 years ago

henryyuanheng-wang commented 4 years ago

In the paper, you mentioned RootNet can be used during testing stage even without fx and fy. Does that mean the RootNet can still give a good estimate of depth with fake camera intrinsics values (cuz in the pipeline you built these features (fx, fy, cx, cy) are still required during testing)? Is this because the intrinsics are also somehow learned during training for other unseen images?

I know you are busy with other stuff, so thanks for your reply in advance when you have time!

mks0601 commented 4 years ago

Hi, sorry for the late reply.

The RootNet learn nothing about the camera intrinsic-specific things. The input and groundtruth exist in the intrinsic-normalized space. If I fed some fake intrinsics, the output would be scaled according to the intrinsics. 'Scale' means the scale of the human. I did not consider the human scale (for example, the height of the human is usually 1.7 m?) when feeding the fake intrinsics to the RootNet during the testing stage. However, you can fed some reasonable intrinsics which make the output 3D keypoints have 1.7m height.

henryyuanheng-wang commented 4 years ago

Thanks for the reply! I did realize this afterwards that the intrinsic values need to be provided after a closer look at your implementation.

One observation is that for images where people are far away, the returned root coordinates may have high y coordinate (large height value, suggesting people might be in the air?). I'm wondering if this happens due to the lack of far-distanced people in the training set, making the model unable to adjust the heights when people are far away? In other words, the model doesn't have the ground knowledge about people all staying on the ground.

Does this sound like a reasonable assumption to you?

mks0601 commented 4 years ago

The input of the RootNet is cropped and resized human image. The y coordinate of the root joint is defined in 2D image space, and depth of the root joint is defined in the camera-centered 3D space. As estimating 2D y coordinate of the root joint is highly robust, I think people would not be in the air and would be stay on the ground.

henryyuanheng-wang commented 4 years ago

Would love to get your opinion on this example. This is a photo from MuPoTS. Notice there are two people far away on the background. I plotted the predicted root in 3D space (top of the line indicates the root). The y-coord (vertical coord) for purple and orange is higher than the people in the foreground.

I have some other examples where this is the case, especially if the camera is overlooking the scene (for instance, surveillance camera angle). Since the coordinates are all camera-centered, I figured this makes sense? What do you think?

Screen Shot 2020-04-30 at 9 27 53 AM Screen Shot 2020-04-30 at 9 28 36 AM
mks0601 commented 4 years ago

Sorry I cannot understand the figure. In the bottom figure, there are five lines (blue, red, green, purple, and orange). Could you tell me to which person each line belongs?

henryyuanheng-wang commented 4 years ago

Yep. The blue, red and green line represent the girl and the two guys in the foreground. Purple represents the guy walking by with the backpack in the background, and orange is the guy sitting there.

Thanks!

mks0601 commented 4 years ago

This is an interesting example and I agree that this makes sense because the 3D coordinates are camera-centered ones. The floor plane would be like the black plane in the bottom figure (sorry for my terrible drawing skill :( ). The black plane would be rotated to the xz plane after converting the camera-centered coordinates to world coordinates.

80717208-9a978380-8ac6-11ea-91a1-c947397c726f
henryyuanheng-wang commented 4 years ago

Agreed. Thanks for the opinion. Maybe collecting more data with far-away people in the training set can mitigate the problem, but collecting groundtruth for those can be hard. Looking forward to future solutions. Hope you are staying healthy and safe, and good luck on your research!