some confusion about data preparation

skmhrk1209 / VSRD

The official Implementation of "VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection" [CVPR 2024]

https://arxiv.org/abs/2404.00149

MIT License

34 stars 6 forks source link

some confusion about data preparation #2

Closed qgq99 closed 4 months ago

qgq99 commented 5 months ago

Firstly, very appreciate for your excellent work! I am kind of confused about the preparation of the dataset, as it 's said in the paper:

We use the KITTI-360 dataset for our experiments, splitting it into training (43,855 images), validation (1,173 images), and test sets (2,531 images).

So the total number of images for experiments is 47599(less than 50000), but the total number of images (winch in the directory "data_2d_raw", about 150000) of the whole KITTI-360 dataset is much more than 50000.

Since the size of this dataset is too huge, I got some difficulty to train the model on our limited device resource, so I wonder whether it is right to download the whole KITTI-360 dataset, if not, could you please provide some more exact instructions of the dataset preparation?

I will appreciate it very very much! 😁😁

skmhrk1209 commented 5 months ago

Thank you for your comments! I've just added a detailed description of data preparation on README.md. Frames without camera poses or instance masks are excluded during data preparation. Moreover, target frames without at least 16 source frames are excluded during frame sampling. As you mentioned, the number of frames that can be used for our auto-labeling is less than that in the KITTI-360 dataset for the above reasons. However, our code requires the whole KITTI-360 dataset. If you still have any questions, feel free to make comments on this thread!

qgq99 commented 5 months ago

Thank you for your comments! I've just added a detailed description of data preparation on README.md. Frames without camera poses or instance masks are excluded during data preparation. Moreover, target frames without at least 16 source frames are excluded during frame sampling. As you mentioned, the number of frames that can be used for our auto-labeling is less than that in the KITTI-360 dataset for the above reasons. However, our code requires the whole KITTI-360 dataset. If you still have any questions, feel free to make comments on this thread!

Thank you very much！ I have finished the preparation of the dataset! Now, I got some new questions, which is:

Since it can generate 3D bboxes, so could VSRD serve as an independant 3D object detector?
After training of VSRD, can I use it to generate pseudo labels for my custom unlabeled dataset?

🤗

skmhrk1209 commented 5 months ago

No, it doesn't serve as an independent 3D object detector for the following reasons:
- Instance masks and camera poses are required.
- It takes about 15 minutes to optimize 2D bounding boxes for each frame on V100.
No, VSRD cannot generate pseudo labels for an unlabeled dataset since it requires 2D instance masks and camera poses anytime.