tsattler / meshloc_release

BSD 3-Clause "New" or "Revised" License
184 stars 15 forks source link

Questions about Depth Map in MeshLoc #12

Open LSK0821 opened 1 month ago

LSK0821 commented 1 month ago

Dear tsattler and v-pnk, First, I would like to express my admiration for your impressive work on the MeshLoc project. As a researcher working on visual relocalization, I find your project to be incredibly insightful and valuable.

After reading the MeshLoc code and papers, I have some questions. Question 1: MeshLoc mentions that 3D meshes are generated using the SPSR reconstruction algorithm based on the dense point clouds from MVS. Since SPSR tends to produce smooth surfaces, this may result in the loss of original surface details . Is it accurate to use the depth map rendered by OpenGL to recover 3D coordinates under these circumstances? image Question 2: Even the best 3D mesh reconstruction algorithms face challenges related to geometric accuracy, such as holes and distortions. When I opened the dataset /2022MeshLoc/aachen_day_night_v11/meshes/AC13_colored.ply in MeshLab, I observed many holes and distorted areas. In these regions, using OpenGL-rendered depth maps introduces significant errors when recovering 3D coordinates. How should visual localization handle such issues? image Question 3: In contrast, sparse SfM point clouds inherently provide 3D coordinates, avoiding potential errors introduced by mesh reconstruction and depth map rendering. However, in some datasets, the localization accuracy of sparse point clouds is even lower. What might be the reason for this lower accuracy?

Thank you so much for your time and for sharing such important work with the community. I truly appreciate any assistance you can provide.

v-pnk commented 1 month ago

Hi @LSK0821 ,

Question 1: Naturally, SPSR introduces some errors. We ablated these errors in relation to the depth of octree used in SPSR. The number in the mesh name (e.g., AC13) marks the octree depth used in SPSR. Larger octree depth results in more detailed mesh. You can find the parameters of the generated meshes in Tab. 1. The ablation in Tab. 3 uses real images and rendered depth maps so the differences between the columns in the table are coming purely from the different SPSR octree depths. We did not do any ablation using other mesh reconstruction algorithms or changing other parameters of SPSR.

Question 2: The holes in the mesh result in 2D2D correspondence not being lifted to 2D3D correspondence, so those are not used for localization. The distorted areas can lead to inaccurate 2D3D correspondences. We assume there are large parts of the depth maps that are still valid, resulting in many consistent and accurate 2D3D correspondences. The inaccurate correspondences are then filtered out in pose estimation RANSAC as it picks the minimal sample with the most inliers (which are the consistent and accurate 2D3D correspondences).

Question 3: Both pipelines come with their own sets of error sources. SfM point clouds can contain inaccuracies due to errors in local feature extraction and matching. The accuracy of mesh is influenced by errors in the estimated camera parameters (in our case coming from SfM) and the errors introduced during MVS and SPSR. It is not surprising to me that the different sources of error can in some cases result in better localization accuracy when using a mesh.

LSK0821 commented 1 month ago

Thank you very much for your explanation! I still have some points of confusion as follows:

  1. How is the scale information of the mesh determined in the 12Scenes and aachen_day_night_v11 datasets? As we know, monocular cameras do not provide scale information.
  2. I couldn’t find the ground truth for the poses of aachen_day_night_v11 in the provided links. Are the poses obtained through SfM used as the ground truth?
v-pnk commented 1 month ago
  1. The scales of the meshes are the same as the scales of the reference SfM models from which the meshes were generated. Aachen dataset SfM model was aligned to OpenStreetMap (check Image Retrieval for Image-Based Localization Revisited), 12 Scenes dataset was captured using RGB-D camera so the data is already captured in metric scale (check Learning to Navigate the Energy Landscape).
  2. The ground truth poses for reference images is provided in the Aachen Day-Night v1.1 data repository. The ground truth for query images is hidden, but your estimates can be evaluated using the The Visual Localization Benchmark.