Improving DepthAI integration

Serafadam commented 8 months ago

Hi, one of the developers from Luxonis here :slightly_smiling_face: We'd be happy to discuss what improvements could be made regarding running RTABMap together with DepthAI-based devices to provide better out-of-the-box experience, since we've noticed that many users are choosing this approach.

Currently, there are two different options when running RTABMap with OAK, one is available in the core RTABMap library, that uses OAK's in Stereo mode but we also have an example launch file inside depthai-ros that's currently set up to use RGBD mode, with some basic parameters. Some initial thoughts to start off:

Recently we've added an option to publish detected features from cameras in ROS,we was wondering if and how much of a performance improvement that would make if those features were used by the Visual Odometry nodes (and how to integrate that).
Regarding VIO setup, I wanted to ask about the differences between Stereo and RGBD options, are there any benchmarks that show differences between the two and which one would be the better default option? I guess the general answer is "it depends", but it would be nice to pinpoint the differences. Additionally, current depthai_ros_driver has the option to be reconfigured in runtime to provide different types of streams, change resolution or other sensor parameters, I was wondering if that's something that could be useful in the mapping scenario.

I would also appreciate comments on the current state of the integration, what could be further improved. Also, do you have access to OAK devices for testing?

matlabbe commented 8 months ago

Hi @Serafadam ,

Yes, this is something we want to add to rtabmap_ros, this could be useful to offload feature extraction computation on the cameras.
In general, VIO can work with only one camera and IMU (like on phones). Adding another camera (stereo) can help to better estimate scale of the tracked features. The cameras can be either monochrome or RGB, that doesn't really matter. What matters is to get the largest FOV, global shutter, hardware sync between the shutters of the cameras (in case of stereo), and time synchronization between image stamps and IMU stamps (in case of inertial visual odometry). If a TOF camera is used to get the depth (in RGB-D case), the problem is that it is often not synchronized with RGB camera, but if VIO is computed from RGB+IMU without TOF, we can interpolate the pose of the TOF camera to correctly reproject the point cloud in RGB frame (similar to what we did with Google Tango). Using a TOF camera can however generate better accurate point clouds (.e.g, kinect azure, google tango, or iPhone/iPad LiDAR), though their range is limited, particularly outdoor with sunlight. If depth is computed from stereo cameras, and another RGB camera is used (using stereo just to get depth, and estimating VO with RGB-D), ideally the three cameras would need to be hardware synchronized to estimate trajectory as good than with stereo camera. I guess in this setup, the advantage is that we could use IR projector to improve depth estimation with the IR stereo camera, while being able to track features using RGB camera (which won't see the static IR pattern).

Current state of depthai/rtabmap integration

Standalone library

With rtabmap standalone, by default, the left and right IR/monochrome cameras are used. We choose not to have option for RGB camera as the FOV is smaller, it is a rolling shutter and frames are not hardware synchronized with the monochrome cameras. This is based on original OAK-D camera (only one I own). If you have new cameras with RGB global shutter, larger FOV and synchronized with left/right monochrome cameras, we could check to add more options to use the RGB stream.

Nevertheless, current exposed options can be seen in this panel:

We can enable depth computation on the edge
Thx to @borongyuan, we recently added feature extractions (GFTT) on the edge, and even supporting an experimental version of SuperPoint running on the edge

ROS

For ROS1, we provide a stereo example and a RGB-D example from command lines.
For ROS2, we could add depthai examples here similar to different variations of D435i examples: https://github.com/introlab/rtabmap_ros/tree/ros2/rtabmap_examples/launch

That could be interesting to make ros example using features extracted on the camera, but rtabmap_ros is lacking the interface to feed them to rtabmap library.

Serafadam commented 8 months ago

Thanks for the response! I would also like to thank @borongyuan for the current progress regarding integration on the bare C++ side, I would also appreciate comments regarding this integration part.

rtabmap_ros is lacking the interface to feed them to rtabmap library.

What would be the steps needed to add that to the package?

borongyuan commented 8 months ago

Hi @Serafadam,

I just noticed this issue yesterday. I have been updating rtabmap standalone and haven't added anything to rtabmap_ros yet. I first tried to make some improvements to rtabmap config in https://github.com/luxonis/depthai-ros/pull/292. At that time I also noticed that there are some implementations in depthai-ros that need to be adjusted. https://github.com/luxonis/depthai-ros/pull/269, https://github.com/luxonis/depthai-ros/pull/277, https://github.com/luxonis/depthai-ros/pull/279, https://github.com/luxonis/depthai-ros/pull/325, https://github.com/luxonis/depthai-ros/pull/346 But this can actually make things a little more complicated. depthai-ros is not only developed for SLAM applications, it needs to include support for various robotic applications. I know you've been working on adding more features to it and even refactoring some of the code. Running the system through ROS also introduces further potential issues, such as communication and clock. So I decided to develop based on depthai-core and rtabmap standalone first. This makes it more controllable overall. And I can try many advanced tunings in depthai-core to optimize for SLAM. Many of them are not wrapped to depthai-ros. We can add the verified part to depthai-ros later. Another benefit of doing this is to provide an out-of-the-box experience for users of RTAB-Map and OAK. Even if they have no ROS-related experience, they can get started and try SLAM directly.

Speaking of out-of-the-box experience, I think an important reason that hindered the widespread application of VSLAM is that it sets a high barrier for users. For most open source VSLAM frameworks, users generally need to write their own set of configurations. These mainly include sensor intrinsics and extrinsics, as well as accuracy indicators (generally allan variance). The configuration is different for each individual device, which is only suitable for research and not for applications. Almost all types of camera drivers in RTAB-Map will try to parse built-in params, which is more convenient to use. So when I updated CameraDepthAI.cpp, I carefully checked and modified this part first. Although rtabmap standalone does not depend on ROS, it basically follows the conventions of ROS and OpenCV. For future ROS use, as long as our packages follow conventions such as REP-104 and REP-145, there should be no problems. I browsed your factory calibration package. I believe that the factory calibration params are reliable. It would be better if you could ship each device with a reprojection error statistical report similar to that in Kalibr. This will give us more confidence when using it. The missing params now are Imu to camera extrinsics. I mentioned it before in this PR. This is required for VIO. Our current approach is to check the camera model first, and then set it to the value measured from CAD. It doesn't actually need to be calibrated when we already know it from the board design data. You can put the board design data in the device, so that we can read it directly through your API without writing such ugly code. What really needs to be calibrated are IMU intrinsics. But this part is a bit more complicated, and it's not required. We can discuss this part later when we want to further improve the algorithm accuracy. For the allan variance of the IMU, we can also preset it for the user. The IMU model is known, and we can increasing the rated params by a factor of 10x as stated in the Kalibr wiki. However, the random walk noise of BMI270 is only provided under NDA. I don't know if you have got this data. My current approach is to perform noise analysis on several devices we have, and then look at the average and distribution. The noise density we measured is consistent with that in the datasheet. For BNO086, I haven't checked the datasheet yet. Since its accelerometer and gyroscope can not output at the same frequency, we may need to interpolate first before doing analysis. BMI270 still has output rate issues, especially on PoE devices. Fortunately it can at least output at 100Hz, which is just enough for VIO. Another small problem is that the IMU message of ROS and the IMU data of RTAB-Map only define covariance, not the allan variance commonly used in VIO. Not sure if we should propose updating the IMU message definition in a later ROS version. Currently the two packages imu_tools and allan_variance_ros are maintained by Martin Günther. Maybe we should discuss this with him.

Next, let’s talk about some issues related to hardware synchronization and exposure. As @matlabbe said, for VSLAM applications, we prefer large FOV, global shutter, with hardware synchronization. For global shutter, the most reasonable configuration is to set the image timestamp to the middle of the exposure interval. I saw that you recently added this option to depthai-ros, as well as alpha scaling related code. For rolling shutter, there are actually a few algorithms that can support it. But I still need to study the hardware synchronization part of your document to see if there is a better way to handle it. The color camera of your main models basically use rolling shutter, so we currently do not support RGB stream. Your Chinese partner gave me an OAK-D-PRO-W-DEV last month. He changed the main camera to wide FOV OV9782 as I wished, and the IMU is BNO086. Perhaps you can also provide @matlabbe with more suitable hardware. 20231013_211527 I actually have another question regarding hardware synchronization. Although the left and right cameras are triggered at the same time, their exposure appears to be controlled independently. I noticed that their timestamps are not exactly the same, and the timestamp of the depth image is sometimes the same as the left camera, and sometimes the same as the right camera. So I am synchronizing data based on sequence number, and setting the timestamps of all images to the timestamp of depth image. Can you provide further explanation on the timestamp rule of multi-input nodes such as StereoDepth? Also, is only the StereoDepth node able to perform stereo rectification? What I mean is that if we only want to use stereo input, we only need the camera to provide rectified images without disparity calculations, which should save some resources for other functions. I currently choose stereo input when using OAK-D Pro W PoE. Because its disparity map cannot be compressed for transmission when subpixel is enabled. I hope you can add this feature request to the plan. I also made a feature request about IR dot projector On/Off mode before. You have now provided a way to implement this feature through scripts and added documentation. I'm interested in this, but it's a bit difficult to use. It obviously makes synchronization more complicated. Maybe I should try to only calculate the depth when dot projector is on to achieve a higher frame rate. Then merge the two depth images before and after the illumination LED is turned on. This is equivalent to applying a short temporal filter as well?

Regarding VIO, RTAB-Map does not have VIO at first. RTAB-Map initially only focused on appearance-based loop closure detection, which is the back-end part of SLAM. I remember that @matlabbe later implemented VO based on the tutorials by Davide Scaramuzza and Friedrich Fraundorfer? However, it seems a bit difficult to change RTAB-Map's own VO implementation to VIO. Because many VIO algorithms are derived and modeled based on IMU, such as VINS-Mono and OpenVINS. There are also camera-based algorithms, such as ORB-SLAM3. RTAB-Map also added support for other odometry later. I updated OpenVINS integration for RTAB-Map recently. It works really well on OAK cameras. I chose to learn OpenVINS first because it provides a great MSCKF implementation. Its author, Professor Guoquan Huang, seems to have also participated in the Google Tango project before. We know that RTAB-Map and Tango also have some relationship. So I think I can at least reproduce the same performance as before on the yellowstone tablet. You may have seen the demo I recently recorded. OpenVINS not only has high accuracy, it’s also very efficient. The test hardware I've been using a lot recently is Jetson Orin Nano and a UMPC with m3-8100Y processor. I don't need to lower the camera frame rate for this hardware at all. So this combination is likely to run on RAE as well. Our RAE users in China are currently experiencing some network problems. I'll test it later. I also noticed that @saching13 is modifying Kimera. Is that part already available? Kimera uses the BSD license, which has fewer restrictions to applications. The KLT feature tracker provided by OAK is very suitable for MSCKF. I'm also developing a MSCKF implementation that uses the hardware acceleration. Apart from MSCKF, the only other efficient VIO algorithms seem to be the DSO series. But this requires us to perform photometric calibration, and more strict exposure control is required. This will also limit the use of IR illumination. But DVO can generate high-precision semi-dense point clouds. This has some inspiration for us to improve OAK's depth measurement.

saching13 commented 8 months ago

Thats a great sumay. Thanks @borongyuan

It would be better if you could ship each device with a reprojection error statistical report similar to that in Kalibr. This will give us more confidence when using it.

We have reprojection errors of each device logged internally and stored. As we grow and have more resources we can make them available to be searched based on device id over portal or robothub.

The missing params now are Imu to camera extrinsics. I mentioned it before in https://github.com/luxonis/depthai-ros/pull/347#issuecomment-1627612578. This is required for VIO. Our current approach is to check the camera model first, and then set it to the value measured from CAD

We are working on this. We are updating our board files to include this information for all the devices. Devices calibrated by next month would include this.

The color camera of your main models basically use rolling shutter, so we currently do not support RGB stream. Why not just use the stereo and ignore that Rolling shutter RGB for this case ?

I noticed that their timestamps are not exactly the same, and the timestamp of the depth image is sometimes the same as the left camera, and sometimes the same as the right camera. So I am synchronizing data based on sequence number, and setting the timestamps of all images to the timestamp of depth image. Can you provide further explanation on the timestamp rule of multi-input nodes such as StereoDepth?

@alex-luxonis Any idea on this ?

Also, is only the StereoDepth node able to perform stereo rectification? What I mean is that if we only want to use stereo input, we only need the camera to provide rectified images without disparity calculations,

For that you can use the wrap node here or you can use the generic camera node which does the undistort on the Camera node.

I also noticed that @saching13 is modifying Kimera. Is that part already available? Kimera uses the BSD license, which has fewer restrictions to applications.

We had to shutdown the Kimera work due to some issues and pivoted to build on-top of svo. It is in beta version at the moment. I would be happy to add you to try it out and provide feedback on any suggestions and refinement.

I'm also developing a MSCKF implementation that uses the hardware acceleration.

You mean on OAK hardware acceleration ? I will go over MSCKF and see if we can use it as we start moving parts of svo_pro into fw.

borongyuan commented 8 months ago

For that you can use the wrap node here

Wrap node should be able to do this. But we need to prepare the mesh ourselves using cv::initUndistortRectifyMap().

You mean on OAK hardware acceleration ?

The tracker part of MSCKF only needs to add RANSAC after KLT tracking. For example, LMedS used by OpenVINS, or the 2-point RANSAC used by msckf_vio. Is it possible to add RANSAC into fw? For the left to right feature matching required in stereo mode, we can simply subtract the dispairty from the features positions in the left cam to get the features positions in the right cam. Disparity is also calculated by OAK. Although this is slightly less accurate than direct tracking. The subsequent EKF part may not be easy to run in OAK, but we have already offloaded a lot of computations.

saching13 commented 8 months ago

Wrap node should be able to do this. But we need to prepare the mesh ourselves using cv::initUndistortRectifyMap().

Yes. but that is not needed if you use generic Camera node.

saching13 commented 8 months ago

As of RANSAC we need to see how computationally expensive it would be and how many shaves are available.

we can simply subtract the dispairty from the features positions in the left cam to get the features positions in the right cam.

When we do feature tracking on both left and right. It would need additional step of computing a descriptor to match them. Is that is what you mean here ? What we have been thinking of is doing a feature track on right-rectified with depth aligned we get the 3D points and do RANSAC based Pose estimation using Arun's / essential (Need to take depth-map noise into consideration). And an EKF on the pose estimation.

The subsequent EKF part may not be easy to run in OAK, but we have already offloaded a lot of computations.

Curious to know why you think EKF would be harder than RANSAC. I see ransac being more computationally expensive than EKF over states. Unless EKF includes landmark/map optimization too.

borongyuan commented 8 months ago

When we do feature tracking on both left and right. It would need additional step of computing a descriptor to match them. Is that is what you mean here ?

What I mean is to perform feature tracking in only one camera, and then find the location of the features in another camera. Both msckf_vio and OpenVINS use KLT for left to right sparse matching. But OAK has already done dense matching when calculating disparity, so this can be omitted. OpenVINS also has trackers based on ORB descriptor and ArUco marker, but they are rarely used. This part of the comments of msckf_vio explains the steps clearly.

What we have been thinking of is doing a feature track on right-rectified with depth aligned we get the 3D points and do RANSAC based Pose estimation using Arun's / essential (Need to take depth-map noise into consideration). And an EKF on the pose estimation.

This seems like an easy solution. Does this solution use loosely-coupled strategy? It may be more appropriate to use inverse depth parametrization for feature noise processing.

Curious to know why you think EKF would be harder than RANSAC. I see ransac being more computationally expensive than EKF over states. Unless EKF includes landmark/map optimization too.

The basic EKF is easy to implement, but MSCKF is more complicated. OpenVINS, in particular, integrates various improvements to MSCKF. What it implements is actually Hybrid SLAM, a mixture of EKF-VIO and MSCKF. Some features will be used for stochastic cloning, and some long-term tracked features will be added to the state. Some optimizations are involved in the feature refinement and dynamic initialization parts. In addition, the accuracy of MSCKF largely comes from observability analysis and null space maintenance technology, such as First Estimate Jacobian and Observability Constrained EKF. These computations are not necessarily large, but I don't know whether it is possible to put so many modules on RVC.

matlabbe commented 8 months ago

Thanks @borongyuan for all this information, pretty useful! I confirm RTAB-Map doesn't have a real VIO approach by default (it is more a VO approach on which I added afterwards IMU constraints mainly to align the VO with gravity and for better feature tracking), it relies on third parties VIO approaches like OpenVINS to do it.

@Serafadam

rtabmap_ros is lacking the interface to feed them to rtabmap library.

What would be the steps needed to add that to the package?

Actually, rtabmap node has a way to input features (Keypoints with/without descriptors), by filling those fields in RGBDImage msg:

https://github.com/introlab/rtabmap_ros/blob/61bd84bcd6062abef1f2c0596ecc564704ceae16/rtabmap_msgs/msg/RGBDImage.msg#L18-L23

One example I had is that rgbd_odometry/stereo_odometry can output a RGBDImage topic with extracted features from VO, which can be fed to rtabmap node for mapping/loop closure detection.

For rgbd_odometry, currently it cannot use input features, it will always re-extract. However, it could be easily implemented here to forward the features from the input topic to odometry estimation: https://github.com/introlab/rtabmap_ros/blob/61bd84bcd6062abef1f2c0596ecc564704ceae16/rtabmap_odom/src/nodelets/rgbd_odometry.cpp#L576-L589

The ideal output from depthai-ros would be a feature topic (a list of 2D keypoints with their descriptor) with same stamp than the image frame frame used to compute the features. I could then make a node to convert/sync your topic format with camera_info/images to generate a rtabmap_msgs/RGBDImage that could be used for rgbd_odometry and rtabmap nodes. The minimum data required in rtabmap_msgs/RGBDImage to enable VO and loop closure detection are:

header with frame_id of the camera, and stamp of the image on which the features were extracted
the RGB (or left) rgb_camera_info
the key_points field containing the 2D points in the image
the points field, the corresponding 3D points of the 2D key_points (same size). If 3D point estimation is not done on the camera, we can feed left/right images in the topic and they will be recomputed inside rgbd_odometry or rtabmap.
the descriptors, following opencv convention. The number of rows should match the key_points size. If descriptors are not computed by the camera, rgbd_oodmetry or rtabmap will re-extract them using the key_points (in this case, the image is needed in the topic).

saching13 commented 8 months ago

@matlabbe @borongyuan Do you have a recommendation on IMU. We are looking for a better one for the upcoming product.

borongyuan commented 7 months ago

Do you have a recommendation on IMU. We are looking for a better one for the upcoming product.

I'm not sure what aspects you mean by better. Regarding the accuracy indicators and selection of MEMS IMU, I have also read the experiences shared by others before. Although there are many aspects to consider, as technology has advanced in recent years, devices from various brands are getting better and better. Some indicators may no longer be the shortcomings of the system. I will have some exchanges with the guys from Arducam and OAKChina tomorrow. I can discuss this with them first. I think we can organize a more detailed discussion about new hardwares later.

ryanESX commented 6 months ago

@matlabbe Sorry for hijacking this thread. But I have question on your reply(https://github.com/introlab/rtabmap_ros/issues/1045#issuecomment-1763488003). It is possible to send extracted features from other ros node if we are using RGBD odometry/stereo odometry. Is it possible to send extracted features from other ROS node without rgdb odometry/stereo odometry? like using d435+ t265 or d435 + open_vins(ros wrapper). In this case, odometry will get from external rather from rgbd odometry/ stereo odometry

borongyuan commented 6 months ago

I opened a new project, the world’s first roller coaster SLAM dataset: https://github.com/Factor-Robotics/Roller-Coaster-SLAM-Dataset The first sequence was recorded using OAK-D Pro W. Welcome to challenge and contribute more data.

borongyuan commented 6 months ago

It is possible to send extracted features from other ros node if we are using RGBD odometry/stereo odometry. Is it possible to send extracted features from other ROS node without rgdb odometry/stereo odometry?

Yes, if you run RTAB-Map via ROS, its front-end and back-end can be completely separated.

matlabbe commented 6 months ago

To add to @borongyuan , when rtabmap node is subscribing to a RGBDImage topic, you can fill the feature fields to make rtabmap use those features (with Mem/UseOdometryFeatures=True) instead of re-extracting on its side.

ryanESX commented 6 months ago

@matlabbe thanks for your advice. It works. But the descriptors should be compressed according to this function https://github.com/introlab/rtabmap/blob/master/corelib/include/rtabmap/core/Compression.h#L8.3

matlabbe commented 6 months ago

We provide a python API to compress and uncompress rtabmap data. We are using zlib under the hood, but this function will make sure the header of the matrix is correctly set in the output compressed binary: https://github.com/introlab/rtabmap_ros/blob/master/rtabmap_python/src/rtabmap_python/compression.py

borongyuan commented 5 months ago

@saching13 @Serafadam I've added support for OAK-D-PRO-W-DEV in RTAB-Map. After some evaluation, I feel that using a global shutter color camera is unnecessary. Because in any scenario, using the monochrome stereo camera for SLAM is a better choice. We can use the color camera for data colorization only. But this configuration is somewhat beyond the current design of RTAB-Map. So I'm trying this approach in other projects. This way we can continue to use mainstream OAK cameras without customization. If the jelly effect of the color camera is obvious, I will consider a solution similar to Gyroflow for correction.

I'm also interested in your new factory calibration process and hope you can share more details. In various VSLAM algorithms, researchers have noticed that mono VIO often has smaller errors than stereo VIO. However, stereo VIO has better robustness because it can avoid scale drift caused by degenerate motion. The accuracy of stereo VIO is largely related to calibration. Since stereo calibration is minimizing the reprojection error, the algorithm will be more sensitive to points at close range. Therefore, the distance measurement of distant points may be too large or too small. This will also cause VIO to have a relatively fixed scale error on specific hardware. This paper has an in-depth analysis and discussion of these aspects. I'd like to know if your new calibration process also addresses these improvements.

Thanks

CenekAlbl commented 5 months ago

@borongyuan thanks for the update!

AS for your question about calibration - we significantly improved the data which is collected for calibration and the calibration procedure itself in the factory. The accuracy of all models will be significantly improved and also much more stable across units. WRT to the distant points - as you say the accuracy of those will be worse than for close points, in SLAM settings I would probably limit the distance of points being used and minimized over.

For the jelly effect of RS color cameras - this effect can also be reduced by using lower resolution. E.g. 4K RGB RS camera will have significant delay between the top and bottom line, whereas when set to e.g. 1080 horizontal, the delay between the top and bottom line will be smaller -> less RS deformations.

introlab / rtabmap_ros

Improving DepthAI integration #1045

Current state of depthai/rtabmap integration

Standalone library

ROS