Open borongyuan opened 6 months ago
It is sure it depends what the end goal is. From robotics point of view, one major issue with OctoMap is not really the update step, but that on loop closure detection (when the map's graph is optimized), we need to re-generate the whole OctoMap from scratch. I tried some TSDF approaches in the past (open_chisel that was originally created for Google Tango and cpu_tsdf), but get always stuck on this question: "How to update efficiently online the 3D reconstruction to match the new optimized graph after a loop closure?". Because of that question, these 3D reconstructions approaches are mostly used only offline with RTAB-Map (options available when doing File->Export Clouds...). If with OpenVDB the re-generation is fast (e.g., faster than OctoMap), that could be indeed useful to integrate. Note that if we would want only to use a 3D reconstruction in localization mode (knowing that we won't update the global map), that may be indeed useful for 3D route planning indoor (e.g., for drone). Approaches derived from Elastic Fusion may have some answers to question above, it had some results for deforming a TSDF after loop closure detection, but not sure if it scales well (and it requires quite good computer/GPU to run).
Thanks for the link to STVL, very interesting to get a speed boost on the default costmap voxel layer.
For example, returning to a visited location from the opposite direction has always been a problem for appearance-based loop closure detection.
Yes, it is. New approaches like SuperGlue can match two images looking at same thing but from very different point of views. However, it could be interesting to also see some photo-realistic 3D reconstruction in which we could simulate an image (combination of multiple images of the area taken from different point of views than the current position of the robot) to compare with actual point of view of the robot. NERF volumes can be very realistic, assuming the robot knows where it is in the environment, it could localize by generating an image from the model at that position. While I see great potential for robot localization, I don't see yet how to use NERF online in SLAM mode.
How to update efficiently online the 3D reconstruction to match the new optimized graph after a loop closure?
This is also something I have been thinking about. OpenVDB is often used for simulating and rendering sparse volumetric data such as water, fire, smoke and clouds. So it can certainly be used for fast re-generation. But VDBFusion can't do it yet. Its current API only allows adding data. But I am not familiar enough with OpenVDB yet, so integrating VDBFusion first is a good starting point. We’ll see how to improve it later. What I know about OpenVDB is that it is suitable for processing data that is globally sparse and locally dense. It happens that SLAM data has such characteristics.
Another question I have been thinking about is what kind of representation is more suitable for SLAM maps? Different types of environment representations seem to be needed for map visualization, robot navigation, and human-robot interaction. Is it possible to design a universal intermediate representation that can be quickly converted to other types? When I was working on local/global descriptors, I realized that these models actually provided sparse image embedding (only for keypoints). So it naturally occurred to me that we need to try dense image embedding (for every pixels) next. Perhaps SAM is a foundation model worth trying. This seems to imply a path towards semantic SLAM. In the past, people have been focusing on labelized semantics. However, labels have semantic granularity issues and bring ambiguity. Labels cannot describe objects, parts, and subparts well. At the embedding level, there will be no such problem. Moreover, fusion can be performed on multi-frame embeddings. That’s why I now feel that semantic SLAM and 3D reconstruction should be considered together. I hope to build an intermediate representation that can contain information such as color, shape, semantics, etc., and can be updated incrementally. Now many models have Encoder-Decoder structure. A lightweight decoder can convert the intermediate representation into the required output format. So I'm going to try to change it to Encoder->SLAM/Fusion->Decoder.
Hi, I previously considered adding NeRF or 3DGS for 3D reconstruction. But the data they generate seems less suitable for robotic applications. Although they might be useful in other ways. For example, returning to a visited location from the opposite direction has always been a problem for appearance-based loop closure detection. But if we can get a good rendering from the opposite perspective, we might be able to handle this situation. Theoretically there is now an existing way to use data from RTAB-Map for NeRF or 3DGS, but I haven't had time to try it yet.
RTAB-Map -> AliceVision -> COLMAP -> NeRF/3DGS
Now I am more interested to try solutions based on OpenVDB first, such as VDBFusion. Because its data structure is indeed more suitable for handling very large scenes required by SLAM. It's superior to OctoMap, but lacks attention in the robotics field. Currently known robot tools using OpenVDB include Spatio-Temporal Voxel Layer developed by Steve Macenski, and mapit (a project with the same name as one of your apps). When using STVL before, I was really impressed by its high performance. I took a quick look at the VDBFusion code and the API doesn't look complicated. They also contributed some patches to OpenVDB, but all involve the Python wrapper part. If we only need the C++ API, we can try to use OpenVDB distributed by Ubuntu directly.