skmhrk1209 / VSRD

The official Implementation of "VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection" [CVPR 2024]
https://arxiv.org/abs/2404.00149
MIT License
30 stars 6 forks source link

Two questions about the approach #7

Closed johannes-tum closed 5 days ago

johannes-tum commented 2 weeks ago

Thanks a lot for your great paper and approach. I would have two questions:

johannes-tum commented 2 weeks ago

And one more question: When you are speaking of mult-view images:

skmhrk1209 commented 1 week ago

Apart from Table 5 you are performing all experiments on KITTI-360. What is the reason for that? Does KITTI-360 provide something that KITTI lacks? E.g., masks?

Yes. The KITTI dataset does not provide masks. Moreover, the KITTI dataset provides camera poses, but they are not accurate since they are raw OXTS (GPS/IMU) measurements. The KITTI-360 dataset provides more accurate camera poses optimized by SfM.

You are also using 2D bounding boxes for supervision. The original KITTI dataset provides tight 2D bboxes. However, in the domain of weakly supervised monocular 3D object detection I think it is common to use the reprojected 3D edges and reconstruct a bounding box from that. Which one did you use?

We used tight 2D bounding boxes, which can be obtained from instance masks. [code] As you know, the 2D bounding box reprojected from a 3D bounding box does not tightly fit the boundary of the object in the image plane. However, it does not matter, since we also use the proposed multi-view silhouette loss via volume rendering.

And one more question: When you are speaking of mult-view images:

We used temporally adjacent images, so stereo cameras are not required. But since the KITTI-360 dataset provides stereo images, we also can use them.