[Request] Robust Multiple-Baseline Stereo matching for robust stereoscopy in repetitive-pattern environments (RMBS-stereo)

stephansturges commented 1 year ago

Start with Why?

When using stereo devices "in the wild" on human-made objects it is extremely common to encounter repetitive patterns on the objects that you want to retrieve depth from. Brick walls, cobblestone roads, roof shingles, tiling etc... these often constitute the majority of a scene in an urban environment.

Unfortunately, stereo matching as implemented in DepthAI is not particularly good at retrieving depth from these types of patterns, for reasons that are well documented in the history of development of stereoscopy algorithms. See here for more information: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.152&rep=rep1&type=pdf

This can be observed with the current Luxonis devices, such as in this scene where a CM4-PoE device is pointed down at cobblestones from a height of approx 10m. Notice the depth estimation is inconclusive in the area that shows the most repetition. In the current state of the DepthAI library this cannot be solved by tuning parameters in the stereo matching cost function.

(see more examples here: https://discuss.luxonis.com/d/875-depth-parameters-configuration-testing-code/4 ) How can this be solved?

There are different approaches to the solution, but the most promising seems to be using multiple-baseline cameras. See this paper for one implementation and details: https://www.researchgate.net/publication/3916660_A_robust_stereo-matching_algorithm_using_multiple-baseline_cameras

This has already been mentioned also in the context of retrieving additional information from the planned Long-Range device by @ynjiun in this thread on the hardware: https://github.com/luxonis/depthai-hardware/issues/247

In the context of DepthAI the implementation would require using the RGB camera on OAK-D devices in a desaturated mode as a third mono source, and use it to calculate additional disparity maps with one or both of the Mono sensors to refine the local variance value for each pixel in a sliding window.

It would be very useful for the type of work I am doing to see an evolution to this type of stereo matching to solve the problem of repeating textures, and I'm sure it would benefit many in the community!

ynjiun commented 1 year ago

@stephansturges could you share your captured left/right images of this "repeative texture" case? I would like to try my algorithms to see if it can alleviate the "blue" area (no depth). Thanks.

stephansturges commented 1 year ago

@stephansturges could you share your captured left/right images of this "repeative texture" case? I would like to try my algorithms to see if it can alleviate the "blue" area (no depth). Thanks.

Thanks for offering to do the test! You can find example mono files of this location here: https://drive.google.com/drive/folders/14JB64ApZJRZm_Zf1rJe_kx52A7xSQEUp?usp=sharing

I move the camera a few times during capture to provide different examples, but the main cobblestone area was always a "hole" in the depth map.

For reference this was the stereo output during capture of mono images from the depthai pipeline:

ynjiun commented 1 year ago

@stephansturges Thank you for sharing the stereo images. One more thing we need is the camera to camera (stereo: S,K,D,R,T,S_rect,R_rect,P_rect matrix) calibration parameters from your unit. Thank you very much.

stephansturges commented 1 year ago

@stephansturges Thank you for sharing the stereo images. One more thing we need is the camera to camera (stereo: S,K,D,R,T,S_rect,R_rect,P_rect matrix) calibration parameters from your unit. Thank you very much.

Is there a specific way to retrieve these with the depthai package? Or should I perform an openCV calibration to get these params?

ynjiun commented 1 year ago

you could export them from your unit using depthai package, see example code here

ynjiun commented 1 year ago

@stephansturges Hi, just to confirmed the images you shared are: stereoDepth.rectifiedLeft/rectifiedRight or left/right before rectification? Please advise. Thank you.

Erol444 commented 1 year ago

Thoughts @szabi-luxonis on the last approach (using color cam on OAK-D-* for another stereo pair)?

stephansturges commented 1 year ago

@stephansturges Hi, just to confirmed the images you shared are: stereoDepth.rectifiedLeft/rectifiedRight or left/right before rectification? Please advise. Thank you.

These images are not rectified, as far as I can remember. I will get back to you in 48h with a new images set and all of the parameters, thanks :)

ynjiun commented 1 year ago

Thoughts @szabi-luxonis on the last approach (using color cam on OAK-D-* for another stereo pair)?

theoretically "yes", but pratically there are several issues need to be resolved:

global shutter vs. rolling shutter: for stereo vision, it is preferred to use global shutter camera and the center color cam is a rolling shutter
synchronization: I am not sure the center color cam is hardware sync with the stereo pair or not, if not, then it might inject more disparity error then it reduce.
calibration and retification: there is a need to perform 3-way stereo calibration: L-R, L-C, C-R calibration and retification. The current OAK-D-* API stack may not support this requirement.

Well, I thought about this approach before, but later abandon it because of the above three major obstacles. Perhaps you might have ideas for solving the above 3 issues? Please share. Thanks.

SzabolcsGergely commented 1 year ago

1 and 2 True. IIRC there was some work done on IMX378-OV9282 stereo a long time ago, for a customer, but didn't work out well, that's why there's no support for it, I assume.

3 L-R and C-R calibration are enough, which is already performed, from that extrinsic can be calculated for L-C.

stephansturges commented 1 year ago

you could export them from your unit using depthai package, see example code here

These stereo images a NOT rectified. Please find the calibration parameters for this camera below:

RGB Camera Default intrinsics... [[816.10791015625, 0.0, 662.2203979492188], [0.0, 815.0753784179688, 396.26171875], [0.0, 0.0, 1.0]] 1280 800 RGB Camera Default intrinsics... [[816.10791015625, 0.0, 662.2203979492188], [0.0, 815.0753784179688, 396.26171875], [0.0, 0.0, 1.0]] 1280 800 RGB Camera resized intrinsics... 3840 x 2160 [[2.44832373e+03 0.00000000e+00 1.98666113e+03] [0.00000000e+00 2.44522607e+03 1.06878516e+03] [0.00000000e+00 0.00000000e+00 1.00000000e+00]] RGB Camera resized intrinsics... 4056 x 3040 [[2.58604199e+03 0.00000000e+00 2.09841089e+03] [0.00000000e+00 2.58277026e+03 1.50815430e+03] [0.00000000e+00 0.00000000e+00 1.00000000e+00]] LEFT Camera Default intrinsics... [[804.4307250976562, 0.0, 645.8418579101562], [0.0, 805.9994506835938, 394.1195983886719], [0.0, 0.0, 1.0]] 1280 800 LEFT Camera resized intrinsics... 1280 x 720 [[804.4307251 0. 645.84185791] [ 0. 805.99945068 354.11959839] [ 0. 0. 1. ]] RIGHT Camera resized intrinsics... 1280 x 720 [[793.31036377 0. 649.45861816] [ 0. 794.13549805 366.80813599] [ 0. 0. 1. ]] LEFT Distortion Coefficients... k1: -9.06633186340332 k2: 64.01287078857422 p1: 0.00037014155532233417 p2: 0.004507572390139103 k3: -90.78865814208984 k4: -9.145102500915527 k5: 64.22074127197266 k6: -90.83280181884766 s1: 0.0 s2: 0.0 s3: 0.0 s4: 0.0 τx: 0.0 τy: 0.0 RIGHT Distortion Coefficients... k1: -3.764836549758911 k2: 57.089759826660156 p1: -0.0009946062928065658 p2: 0.0036667559761554003 k3: -43.92017364501953 k4: -3.887897253036499 k5: 57.23114013671875 k6: -43.462310791015625 s1: 0.0 s2: 0.0 s3: 0.0 s4: 0.0 τx: 0.0 τy: 0.0 RGB FOV 68.7938003540039, Mono FOV 71.86000061035156 LEFT Camera stereo rectification matrix... [[ 9.72267767e-01 3.32415744e-03 3.37760016e+01] [-1.11641311e-02 9.85241860e-01 2.50823365e+01] [-2.11857628e-05 -8.93426728e-08 1.01356903e+00]] RIGHT Camera stereo rectification matrix... [[ 9.85896693e-01 3.37381855e-03 2.13477234e+01] [-1.13206261e-02 9.99960838e-01 7.32403233e+00] [-2.14827378e-05 -9.06774038e-08 1.01384015e+00]] Transformation matrix of where left Camera is W.R.T right Camera's optical center [[ 9.99970555e-01 4.66156052e-03 6.09558960e-03 -9.00829983e+00] [-4.66190279e-03 9.99989152e-01 4.19921271e-05 1.10289901e-02] [-6.09532790e-03 -7.04079430e-05 9.99981403e-01 -9.86258015e-02] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]] Transformation matrix of where left Camera is W.R.T RGB Camera's optical center [[ 9.99942183e-01 1.07287923e-02 6.62428502e-04 -7.58986807e+00] [-1.07335718e-02 9.99912858e-01 7.69035192e-03 -6.37143180e-02] [-5.79862972e-04 -7.69701786e-03 9.99970138e-01 -1.00172304e-01] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]]

ynjiun commented 1 year ago

@stephansturges

Great! If possible, could we capture another set of stereo rectified images in (1280x720 resolution)? Preferred with similar repetitive texture in the scene. Thanks.

stephansturges commented 1 year ago

@stephansturges

Great! If possible, could we capture another set of stereo rectified images in (1280x720 resolution)? Preferred with similar repetitive texture in the scene. Thanks.

I've updated the folder with a new data collection, you can find the files here (collection2.zip) https://drive.google.com/drive/folders/14JB64ApZJRZm_Zf1rJe_kx52A7xSQEUp?usp=sharing

FYI I actually shot these in 1280x800 because this is the native resolution of my sensors on this device. This is the CM4 PoE device with global shutter RGB unit (OV97820 instead of the standard RGB camera.

Unfortunately it also looks from this data collection like my RGB camera is dirty or out of focus, but I'm not in the office where the camera is at the moment so I can't correct this (I'm running everything over SSH)! I will try to fix this tomorrow and report back with a new collection.

ynjiun commented 1 year ago

@stephansturges

Thanks for the rectified image collection2. After cropped out the rectified border the resulting images resolution used for the test run is 1216x720. Attached below please find the cropped rectifiedLeft_60 image and predicted Distance and disparity map: Distance map colored code: darker at closer, brighter at farther Disparity map colored code: darker for smaller disparity, brighter for larger disparity CM4_PoE For actual value of predicted distance and disparity, you might download the disparity.zip and distance.zip files in .npy format: rectified_left_disparity.zip rectified_left_distance.zip

stephansturges commented 1 year ago

@ynjiun

Thanks for running this test! I'm having a hard time understanding the format and scale of the .npy files, but from what I can see it looks like you have continuous depth all across the frame with no holes. (the color mapping looks bad because I'm not using the correct scale for the values I suspect) Is you algorithm doing anything to fill gaps here? If not this is already a lot better than the output I have from the standard depthai algorithm.

ynjiun commented 1 year ago

@stephansturges

the pixel value at disparity.npy is the actual predicted disparity of the pixel reference to left image. the distance is calculated using the disparity value as below: distance = 70 meter/disparity To see the distance more vividly, I would recommend to scale the min-max range to 0-255 and just display it in graylevel image.

The algorithms actually uses transformer to match features extracted from a deep learning model thus no "gaps" appeared. By the way, what kind of application you are developing for? or in another words, what kind of "distance accuracy" or other requirements you are looking for?

stephansturges commented 1 year ago

@ynjiun Thanks for the explanation of the output, I will set up a better visualization.

The algorithms actually uses transformer to match features extracted from a deep learning model thus no "gaps" appeared. By the way, what kind of application you are developing for? or in another words, what kind of "distance accuracy" or other requirements you are looking for?

I am working with small quadcopter drones and other low-altitude UAV, and I am using the stereo depth as an additional sensing method to a neural network that is designed to detect ground-level obstacles. For this reason I am not interested in using AI-enhancements for the depth estimation because this sensing modality is destined to be kept as "deterministic" as possible while the AI component is working on RGB data, and may be enhanced with RGB+D in the future :) You can find the AI component as a solo FOSS project here: https://github.com/stephansturges/OpenLander

Your method does seem to give great results however!

stephansturges commented 1 year ago

@ynjiun is your approach based on https://github.com/mli0603/stereo-transformer ? I'd be curious to try it on an actual UAV..

stephansturges commented 1 year ago

@ynjiun

Interesting. So you use semantic segmentation for identifying "safe landing zone"? Curious: how do you generate the ground truth? manual labeling? or simulation?

All the data is synthetic, so no labeling required :)

As for the stereo method: if you're willing to share the code I'd be happy to test it!

stephansturges commented 1 year ago

@ynjiun Sure, my email address is my name @gmail .com ;)

stephansturges commented 1 year ago

Anecdotally, I get much better stereoscopy out of the CM4-POE after recalibrating VS the factory calibration.

There is still a patch of "failed depth" in the cobblestones but much less than previously.

luxonis / depthai-python

[Request] Robust Multiple-Baseline Stereo matching for robust stereoscopy in repetitive-pattern environments (RMBS-stereo) #675