Open wusize opened 3 years ago
I also have the same question. Especially in Table 3, this paper compared its speed with VovelPose without considering the 2D pose estimation time, but VovelPose takes RGB images as its inputs and outputs 3D poses. Such kind of comparison is unfair.
@wusize
Yes, it is top-down. For Campus and Shelf, we use the 2D detected poses provided by VoxelPose. For CMU Panoptic, we use the gt bboxes and detect 2D poses with Mask-RCNN.
@fandulu
VoxelPose also supports using 2D pose detections as input to fill in the heatmaps. We compare the running speed with VoxelPose on that part of the framework only, without considering the 2D pose detection time for a fair comparison.
Thanks for your reply. My question is solved.
@wusize
Yes, it is top-down. For Campus and Shelf, we use the 2D detected poses provided by VoxelPose. For CMU Panoptic, we use the gt bboxes and detect 2D poses with Mask-RCNN.
Hi author, the project only provides the detection results of the validation frames, do you have the full detection results. I use HRNet to detect keypoints and I feel that the detection result is far less than yours.
Would you please give more details about your 2D model? Is it a top-down approach? If yes, are the bboxs gt or detected?