ttaoREtw / ImGeoNet

[ICCV 2023] ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection
https://ttaoretw.github.io/imgeonet/
7 stars 0 forks source link

Inquiries about training on ARkitScenes #4

Closed Cindy0725 closed 1 month ago

Cindy0725 commented 4 months ago

Hi, I want ask about the training process on ARKitScenes dataset. How many images are you using for training? I see you use 50 views for testing and get a very high result. Besides, I am wondering the meaning of the following sentence in the paper: bc041b35b117870c8eb425d9f264580 Does it mean the volume origin is set to the camera position in the world coordinate for each scene? Could you please provide the code for training ARkitScenes dataset?

Thank you very much!

ttaoREtw commented 4 months ago

Yes, we shift the volume origin to the mean center of camera locations of each scene. We plan to release the code in July/August. Stay tune😊

Cindy0725 commented 4 months ago

Yes, we shift the volume origin to the mean center of camera locations of each scene. We plan to release the code in July. Stay tune😊

Okay, I will wait for the code release, thanks for the reply! But could you first share some tricks on training on ARKitScenes dataset, especially for the ImVoxelNet baseline? Since I also reproduced imvoxelnet on ARKitScenes using their published code, I can only get around 0.3 mAP0.25 for 50 views (50 views for both training and testing), which is much lower than yours in the paper. I am still struggling with the training tricks. Another question is when you generate the target ground truth from ground truth depth maps, how many depth maps did you use? And did you use the same number of views for training and testing (e.g. 50 views for training and 50 views for testing)?

ttaoREtw commented 1 month ago

Please refer to this and this. It is consistent with the paper.