kxhit / vMAP

[CVPR 2023] vMAP: Vectorised Object Mapping for Neural Field SLAM
https://kxhit.github.io/vMAP
Other
325 stars 20 forks source link

Test vmap on TUM #8

Closed rwn17 closed 1 year ago

rwn17 commented 1 year ago

Thanks for your excellent work and congratulations on the acceptance! I'm trying to reproduce the result on TUM dataset. Here is my process:

1) Run Detic and get the semantic and instance id 2) Write the dataloader following nice-slam.To ensure consistency with the vmap loader, I have deleted the pose transform from the camera frame to the nerf frame. 3) reuse configs file for replica room0

Despite following these steps, I am still unable to obtain meaningful reconstruction results. I have a couple of questions that I hope you can help me with:

1) To ensure accurate results, it is important to have consistent instance IDs right? However, the instance IDs provided by Detic may not be consistent. To overcome this issue, I have assigned semantic IDs to instance IDs and removed semantic classes with duplicated IDs. Are there any better solutions to address this problem? 2) I was wondering if you could provide me with instructions on how to reproduce the TUM datasets regarding hyperparameters. Alternatively, could you kindly share the config file?

Thank you in advance for your help.

kxhit commented 1 year ago

Hi @rwn17 , thanks for your interest in our work!

  1. Yeah, consistent instance IDs are expected to ensure quality. In practice, we use depth info and semantics to check the overlap and propagate the IDs. Recently I found some video segmentation methods work better than Detic in terms of consistency, e.g., Xmem. Could you check if the instance masks for the objects are consistent and well-segmented?
  2. The hyperparameters are the same for TUM, just make sure the intrinsics are correct and the depth range will be smaller in TUM, e.g, [0.0, 6.0]. I would suggest first running iMAP mode to check that the camera pose and intrinsics are working well.

Please let me know if still doesn't work. Thanks!

rwn17 commented 1 year ago

Hello @kxhit , thank you for your prompt response! I have discovered that the noisy background has caused me to miss the foreground in my reconstruction. As a result, my current reconstruction(vmap) appears as shown in the following figure: Screenshot from 2023-03-31 14-57-59

Do you have any suggestions on how to improve the accuracy of the reconstruction except? Thank you in advance!

kxhit commented 1 year ago

Hi @rwn17 , the foreground objects reconstruction looks good to me. The background(bg) is very noisy.

  1. Could you try to visualise some foreground objects only just to verify they are reconstructed well? I suspect the noise comes from some duplicated instances that are only segmented/observed with very few views.
  2. For the bg, could you check the bg mask is consistent? And set do_bg on will init a slightly bigger model and more sampled points for bg reconstruction. I suspect the depth info for bg is mostly invalid which could also cause the low quality as we are using depth guided sampling strategy with a simple MLP and training it online.

Overall, I think the gap is mainly from the inconsistent masks or the large portion of invalid depth. Hope it helps!

rwn17 commented 1 year ago

Hi @kxhit . I check the mesh of the individual object. For ones with a consistent mask and ID(monitor, keyboard), the reconstruction looks good. The noisy part most comes from the inconsistent and jittering mask(book, ground). I will try a better segmentation model later. Thanks for your kind suggestions!

kxhit commented 1 year ago

Yeah, data association and consistent mask tracking are always challenges in the real world. A better front-end e.g., video seg, will definitely improve the performance. In the meanwhile, finding a global constraint that forces individual models to compose a complete 3D scene will be the best. Or using the 3D map to somehow feedback on the segmentation is also interesting.

idra79haza commented 1 year ago

@rwn17 Hi! I know your questions are finished, but I just wanted to ask you regarding the presure you mentioned above, which is

  1. Run Detic and get the semantic and instance id
  2. Write the dataloader following NICE-SLAM.To ensure consistency with the vmap loader, I have deleted the pose transform from the camera frame to the nerf frame.

Regarding the first question, I was wondering what kind of images are needed, since there are "semanticclass.png" files, "vis_semanticclass.png" files, and "semanticinstance*.png" files in pregiven imap data. Oh and I also wanted to ask how you got the semantic and instance id since when you use a Detic with a demo file, The outcome is segmented image itself, not with instance id.

And lastly, I wanted to ask regarding the second procedure. Why did you delete the pose transform from the camera frame to the nerf frame in NICE-SLAM code for implimenting here?!

Thank you so much!

rwn17 commented 1 year ago

Hi @idra79haza , for the first question, I hacked the Detic a little bit to extract the per-pixel instance and semantic id. I'm not sure whether there is a better solution. For the second question, I noticed that in vMAP there is no transformation like NICE-SLAM pose transformation. So I just delete it and it works. Hope it helps.

bilibilijin commented 1 year ago

,对于第一个问题,我稍微破解了 Detic 以提取每像素实例和语义 ID。我不确定是否有更好的解决方案。对于第二个问题,我注意到在 vMAP 中没有像 NICE-SLAM 姿态变换那样的变换。所以我只是删除它,它可以工作。希望对您有所帮助。

Hi! Regarding extracting the instance and semantic ID per pixel with Detic, what I want to ask is which part of Detic can be modified to make it output the above results, if you can give me some hints, it will be very grateful! Looking forward to your reply as soon as possible!