MeshLoc and Visual Localization using Imperfect 3D Models from the Internet，common questions

LSK0821 commented 11 months ago

Hello dear author! These two works are excellent, which is very novel for researchers working in fields related to visual localization! I would like to consult two questions as follows: Firstly, the 12 Scenes dataset is used as experimental data in MeshLoc. This dataset utilizes the RGB-D SLAM method to reconstruct a 3D model of the scene. MeshLoc makes use of the depth map obtained from the rendered model to recover the connection between the features on the query image and the 3D coordinates corresponding to the world coordinate system. So can I understood that the textures on the 3D model don't work? It is the geometric accuracy of the 3D model (e.g. point cloud depth accuracy and mesh construction method) that plays a role, as well as the method of rendering? Secondly, in the article Visual Localization using Imperfect 3D Models from the Internet, CAD models are used, which is a very novel idea! Have the authors ever thought of using NERF as a rendering to complete the image retrieval, and using imperfect models (e.g. CAD models) to complete a rough localization result, and then later using this NERF network with fixed parameters to invert and optimize the rough localization result? Looking forward to your reply! Thanks very much!

v-pnk commented 11 months ago

Hi!

So can I understood that the textures on the 3D model don't work?

We also show the results for rendered images in the supplementary (see page 27 in the ArXiv version). They are a bit worse than the rendered images. Yes, both the accuracy of the 3D model and the rendering method have their influence on the results. From the experiments on the Aachen dataset (see pages 25 and 26), we observed that the quality of the geometry is not as significant as the details in the model colors (colored per vertex vs. textures).

Have the authors ever thought of using NERF as a rendering to complete the image retrieval, and using imperfect models (e.g. CAD models) to complete a rough localization result, and then later using this NERF network with fixed parameters to invert and optimize the rough localization result?

Yes, using NeRFs is an obvious next step, as the MeshLoc pipeline can work with any renderable representation and the pipeline would probably benefit from the ability of NeRF to render photo-realistic images. We did not do any work in the direction of NeRF inversion. We will be happy to see how well that works if you plan to publish such a method!

LSK0821 commented 11 months ago

Hello, thank you for your reply, I am very pleased!

I'm very sorry for the confusion caused by my description of the problem! After my careful reading of Meshloc, my understanding is that if there is a database image with known poses, the only role of the dense model is rendering to provide a depth map that recovers the 3D coordinates of the query image features. Is this understanding correct?

Because I focus on mesh reconstruction and optimization in MVS, I also see that the section "Localization using multi-modal data" of "Visual Localization using Imperfect 3D Models from the Internet" mentioned that "The later line of work has shown that modern local features can match real photos". The later line of work has shown that modern local features can match real photos against non-photorealistic rendering. The later line of work has shown that modern local features can match real photos against non-photorealistic renderings of colored meshes or even against meshes without any color [66, 100, 15]."

So I would like to ask the author if you have ever used a depth map rendered from a dense model with high precision geometry without texture mapping for visual localization? How accurate is the localization?

v-pnk commented 10 months ago

MeshLoc can operate in two modes:

use real reference images (photos) to obtain 2D-2D matches
use rendered images to obtain 2D-2D matches

In both cases the rendered depth maps are used to lift to 2D-3D matches.

If you are asking, if we tried to do matching directly between RGB queries and metric depth maps, no we did not try that. We always somehow converted the geometry to RGB space, either by coloring the mesh (without any prior color information) by ambient occlusion (AO), or by rendering the mesh (without any color information) using shading. You can find examples of both in Fig. 3 of the MeshLoc paper:

AFAIK the ambient occlusion rendering needs a preprocessing step on the mesh (to generate the ambient occlusion coloring), so that cannot be generated directly from the depth maps. The second method is based on lighting setup moving with camera, so it can be created purely from geometry at rendering time. If you would not want to render the RGB images, but only the depth maps, the conversion from depth maps to RGB could be moved to the time of local feature extraction and matching.

The results can be found in Tab. 4 of the MeshLoc paper: And also in Tab. 8 in the supplementary.

v-pnk / cadloc

MeshLoc and Visual Localization using Imperfect 3D Models from the Internet，common questions #2