Two questions about the approach

johannes-tum commented 2 weeks ago

Thanks a lot for your great paper and approach. I would have two questions:

Apart from Table 5 you are performing all experiments on KITTI-360. What is the reason for that? Does KITTI-360 provide something that KITTI lacks? E.g., masks?
You are also using 2D bounding boxes for supervision. The original KITTI dataset provides tight 2D bboxes. However, in the domain of weakly supervised monocular 3D object detection I think it is common to use the reprojected 3D edges and reconstruct a bounding box from that. Which one did you use?

johannes-tum commented 2 weeks ago

And one more question: When you are speaking of mult-view images:

do you mean images that are time-wise adjacent or
do you mean images that have been recorded at the same time with a different camera of the car or
both?

skmhrk1209 commented 1 week ago

Apart from Table 5 you are performing all experiments on KITTI-360. What is the reason for that? Does KITTI-360 provide something that KITTI lacks? E.g., masks?

Yes. The KITTI dataset does not provide masks. Moreover, the KITTI dataset provides camera poses, but they are not accurate since they are raw OXTS (GPS/IMU) measurements. The KITTI-360 dataset provides more accurate camera poses optimized by SfM.

You are also using 2D bounding boxes for supervision. The original KITTI dataset provides tight 2D bboxes. However, in the domain of weakly supervised monocular 3D object detection I think it is common to use the reprojected 3D edges and reconstruct a bounding box from that. Which one did you use?

We used tight 2D bounding boxes, which can be obtained from instance masks. [code] As you know, the 2D bounding box reprojected from a 3D bounding box does not tightly fit the boundary of the object in the image plane. However, it does not matter, since we also use the proposed multi-view silhouette loss via volume rendering.

And one more question: When you are speaking of mult-view images:

We used temporally adjacent images, so stereo cameras are not required. But since the KITTI-360 dataset provides stereo images, we also can use them.

skmhrk1209 / VSRD

Two questions about the approach #7