mks0601 / 3DMPPE_ROOTNET_RELEASE

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019
MIT License
477 stars 65 forks source link

How to match the estimated bbox with the ground-truth 2D pose? #24

Closed zhangyahu1 closed 3 years ago

zhangyahu1 commented 3 years ago

Thanks for sharing your nice work!

I have a naive question about how to match the estimated bbox with the groud-truth 2D pose. Since the bounding boxes are obtained from Mask Rcnn without fine-tuning, the sequence of bbox may be different from the ground truth. I would appreciate if you can give me any hint.

mks0601 commented 3 years ago

Could you give me more details? Why are you trying to match estimated bbox with the GT 2D poses?

zhangyahu1 commented 3 years ago

Becasue I want to generate 2D heatmap of each person in the cropped images based on the bounding box the GT 2D pose.

zhangyahu1 commented 3 years ago

For example, there are two persons (P1 and P2) in the input image, the estimated bbox is (bbox_P1, bbox_P2). Then I plan to generate 2D heatmap using bbox_P1 and the matched GT 2D pose P2d_GT_P1. In this way, the task of multi-person pose estiamtion is sparated by several tasks of single-person pose estimation.

mks0601 commented 3 years ago

There can be several heuristic ways. For example, you can make bbox from GT 2D poses by extending min/max coordinates of joints. Then, calculate IoU with estimated bbox from Mask R-CNN. Choose box with the largest IoU.

zhangyahu1 commented 3 years ago

Thanks for your answer! I will try to implement it. By the way, do you also use this strategy to implement your experiments on MuCo-3DHP dataset? I notice that you also estimated the bbox first.

mks0601 commented 3 years ago

No I don't use GT 2D box during inference.

zhangyahu1 commented 3 years ago

If I understand correctly now, you only need to estimate the 2D pose first on cropped image based on the estimated bbox and then transform the estimated 2d pose from cropped image to the real image space. In this way, you do not need to match the estimated bbox with ground truth bbox/2d-pose/3d-pose.

mks0601 commented 3 years ago

I can do that, but it requires additional pre-processing stage for the 2d pose.

zhangyahu1 commented 3 years ago

Sorry I did not express it clear. You first estimated bbox on MuCo-3DHP dataset. Then the RootNet and PoseNet are trained on the cropped image and corresponding GT 2D/3D poses. My question is that how do you know the each bbox in one image corresponds to the specific GT 2D pose in multi-person dataset in your implementation. Maybe it is a naive question, but it confuses me now.

mks0601 commented 3 years ago

Ah I understand your question. During the training stage, I use GT bbox. During the testing stage, MuPoTS-3D dataset matches each bbox to GT using some heuristics (I'm not author of that dataset, so don't know exactly. But it seems it picks a detection result that has a minimum pose distance with GT).

zhangyahu1 commented 3 years ago

Thanks! I think the advice you gave yesterday will be helpful.