Neural-Assisted Disparity Depth Estimation

Luxonis-Brandon commented 3 years ago

Start with the `why`:

The why of this effort (and initial research) is that any many applications depth cameras (and even sometimes LIDAR) are not sufficient to successfully detect objects in varied conditions. Specifically, for Luxonis’ potential customers, this is directly limiting their business success:

Autonomous wheelchairs. The functionality above it would be HUGE for this application as existing solutions are struggling with the output of D435 depth. It gets tricked too easily and misses objects even w/ aggressive host-side filtering and other detection techniques.
Autonomous lawn mowing. This use-case is also struggling with object detection using D435. The system can't identify soccer-ball sized things reliably even with significant host-side post-processing and then need to be able to identify down to baseball sized things.
Volumetric estimation of low-visual-interest objects. Disparity depth struggles significantly with objects (particularly large objects) of low visual interest as it lacks features to match. Neural networks can leverage latent information from training that overcomes this limitation - allowing volumetric estimation where traditional algorithmic-based disparity-depth solutions cannot adequately perform.

The original idea of DepthAI is to not solve this sort of problem, but it is well suited to solving it.

Background:

As of now, the core use of DepthAI is to run 2D Object Detectors (e.g. MobileNetSSDv2) and fuse them with stereo depth to be able to get real-time 3D position of objects that the neural network identifies. See here for it finding my son's XYZ position for example. This solution is not applicable to the above two customers because the type of object must be known to the neural network. Their needs are to avoid any object, not just known ones, and specifically objects which are hard to pick up, which are lost/missed by traditional stereo depth vision.

New Modality of Use

So one idea we had recently was to leverage the neural compute engines (and SHAVES) of the Myriad X to make better depth - so that such difficult objects which traditional stereo depth misses - could be detected with the depth that’s improved by the neural network.

Implementing this capability, the capability to run neural inference to produce the depth map directly, or to improve the results of the disparity-produced depth map, is hugely enabling for the use-cases mentioned above, and likely many others.

Move to the `how`:

The majority of the work of how to make this happen will be in researching what research has been done, and what techniques are sufficiently light-weight to be run on DepthAI directly. Below is some initial research to that end:

Google Mannequin Challenge:

Blog Explaining it: https://ai.googleblog.com/2019/05/moving-camera-moving-people-deep.html Dataset: https://google.github.io/mannequinchallenge/www/index.html Github: https://github.com/google/mannequinchallenge Notice in a lot of caes this is actually quite good looking depth just from a single camera. Imagine how amazing it could look with 2 or 3 cameras.

Could produce just insanely good depth maps.

KITTI DataSet:

http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo

So check this out. A whole bunch of ground truth data, with calibration pictures, etc. So this could be used to train a neural network for sure on this sort of processing.

And then there's a leaderboard downbelow of those who have.

PapersWithCode:

PapersWithCode is generally awesome. They have a slack even.

https://paperswithcode.com/task/stereo-depth-estimation

Others and Random Notes:

So have a dig through there. This one from there seems pretty neat: https://github.com/CVLAB-Unibo/Real-time-self-adaptive-deep-stereo

These guys seem like they're getting decent results too: https://arxiv.org/pdf/1803.09719v3.pdf

So on a lot of these it's a matter of figuring out which ones are light enough weight and so on to see about porting.

Notice this one uses KITTI dataset as well: https://www.cs.toronto.edu/~urtasun/publications/luo_etal_cvpr16.pdf

From Intel R&D directly: https://arxiv.org/pdf/2001.04552.pdf Apparently this was never implemented. Deep Learning Stereo Vision at the edge
Google’s StereoNet looks really fast/lightweight: https://arxiv.org/pdf/1807.08865.pdf
Github summarizing depth quality enhancements using CNNs: https://github.com/mdcnn/Depth-Image-Quality-Enhancement
This one looks pretty interesting: https://arxiv.org/pdf/1910.00541.pdf

SparseNN depth completion https://www.youtube.com/watch?v=rN6D3QmMNuU&feature=youtu.be

ROXANNE Consistent video depth estimation https://roxanneluo.github.io/Consistent-Video-Depth-Estimation/

https://web.stanford.edu/class/ee368/Project_Autumn_1516/Reports/Jordan_Shridhar.pdf Seems like the Myriad X 2x NCE + SHAVES are plenty fast enough to real-time make a super-great disparity depth output.
https://arxiv.org/pdf/1910.13708.pdf
DDRNet: Depth Map Denoising and Refinement for Consumer Depth Cameras Using Cascaded CNNs:
- http://openaccess.thecvf.com/content_ECCV_2018/papers/Shi_Yan_DDRNet_Depth_Map_ECCV_2018_paper.pdf
- https://github.com/neycyanshi/DDRNet
AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks: https://arxiv.org/pdf/1904.09099.pdf
Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches: https://github.com/jzbontar/mc-cnn/blob/master/README.md
Siamese network. Probably way too big ass it shows multi-second run-times: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6472548/
The middlebury stereo dataset seems incredibly useful https://github.com/kelkelcheng/GC-Net-Tensorflow/blob/master/README.md
DispNetC shows 0.06 runtime, which is encouraging.
Real-time self-adaptive deep stereo
- https://zpascal.net/cvpr2019/Tonioni_Real-Time_Self-Adaptive_Deep_Stereo_CVPR_2019_paper.pdf
- https://github.com/CVLAB-Unibo/Real-time-self-adaptive-deep-stereo/blob/master/README.MD
Pytorch implementation of the several Deep Stereo Matching Network(DSMnet) https://github.com/wyf2017/DSMnet/blob/master/README.md

2emoore4 commented 3 years ago

As a DepthAI user, I want to emphasize the importance of having clean/accurate/precise depth maps - it's clear that deep learning is the key to achieving this.

It's definitely possible to clean up depth maps with more traditional filtering, with something like the Bilateral Solver: https://drive.google.com/file/d/1zFzCaFwkGK1EGmJ_KEqb-ZsRJhfUKN2S/view

However there has been much more work recently to apply deep learning to 3d image generation, and more work is coming all the time.

Stereo Magnification introduced Multi Plane Images, and used differentiable rendering to learn to generate them from stereo images: https://people.eecs.berkeley.edu/~tinghuiz/projects/mpi/

Many have extended on this idea, but much of the latest work uses dozens of input images, instead of just two:

DeepView: https://augmentedperception.github.io/deepview/ Immersive Light Field Video w/ Layered Meshes: https://augmentedperception.github.io/deepviewvideo/ Neural Radiance Fields: https://www.matthewtancik.com/nerf

(Not all of these output MPIs, but all are fairly similar)

There's also plenty of recent work around monocular depth estimation, like MiDaS from Intel: https://github.com/intel-isl/MiDaS

Some take existing 3d photos, and try to inpaint disocclusions, so that inaccuracies are less noticeable: https://shihmengli.github.io/3D-Photo-Inpainting/

Luxonis-Brandon commented 3 years ago

Thanks @2emoore4 ! Super appreciate it. Will review all these shortly. And also sharing with the team!

saching13 commented 3 years ago

I am adding the paper by Skydio which carries out end to end learning for stereo. https://arxiv.org/pdf/1703.04309.pdf

Luxonis-Brandon commented 3 years ago

Thanks!

Luxonis-Brandon commented 3 years ago

This looks quite interesting (Martin brought up internally): https://geometry.cs.ucl.ac.uk/projects/2018/depthcut/

Luxonis-Brandon commented 3 years ago

Check out the datasets referenced near the end of this paper: https://arxiv.org/pdf/1612.02401.pdf The approach is also interesting IMO, and could be adapted for deep learning from stereo. (they are solving a harder problem which is both motion and depth from a pair of images, but you could fix motion since it's known and just focus on the depth part).

Luxonis-Brandon commented 3 years ago

PatchmatchNet: Learned Multi-View Patchmatch Stereo Looks like an interesting paper for resource limited devices. https://github.com/FangjinhuaWang/PatchmatchNet https://arxiv.org/pdf/2012.01411v1.pdf

Luxonis-Brandon commented 3 years ago

Some additional resources from Discord:

https://www.hindawi.com/journals/cin/2020/8562323/

Luxonis-Brandon commented 2 years ago

https://antabangun.github.io/projects/CoEx/#dem

Luxonis-Brandon commented 2 years ago

https://github.com/ibaiGorordo/UnrealCV-stereo-depth-generation

tersekmatija commented 2 years ago

https://arxiv.org/pdf/2007.12140.pdf https://github.com/ibaiGorordo/HITNET-Stereo-Depth-estimation

dhruvmsheth commented 2 years ago

https://github.com/ibaiGorordo/HITNET-Stereo-Depth-estimation

This seems to be pretty accurate. Achieved results on TFlite HITNET Stereo Depth Estimation -

Compared to original results -

Luxonis-Brandon commented 2 years ago

Looks great - thanks for sharing!

Luxonis-Brandon commented 2 years ago

https://github.com/cogsys-tuebingen/mobilestereonet - From @PINTO0309 in Discord.

Luxonis-Brandon commented 2 years ago

The first results are starting to come. Here's MIT Fast Depth (https://github.com/dwofk/fast-depth) running on OAK-D-(anything): vb71SNUBj0

nickjrz commented 2 years ago

Hey @Luxonis-Brandon, this looks like a great starting point for neural network assisted depth estimation. I wonder how precise it can get if we added the depth ground truth in a self-supervised training. Is the inference part running on host and if this is the case, what would it look like to try to optimize the network run on the OAK-D onboard?

Luxonis-Brandon commented 2 years ago

This is running on OAK-D directly, not on the host. Matija will be making a pull request soon so you'll be able to try it. (He may have already and I missed it - unsure... he just got it working this weekend.)

nickjrz commented 2 years ago

I was able to run real-time inference on HITNET Stereo depth estimation (middlebury) using OAK-D and having the inference on the host. Here are my results:

output_ful

PINTO0309 commented 2 years ago

Due to a problem with OpenVINO's conversion to Myriad Blob, I submitted an issue to Intel's engineers (OpenVINO). So far, Intel engineers seem to be concerned that the structure of the model is wrong, but we are able to infer it successfully in ONNX runtime and TFLite runtime.

[Bug] GatherND shape conversion from ONNX is inaccurate #7379 (HITNET to blob / OpenVINO) https://github.com/openvinotoolkit/openvino/issues/7379

ibaiGorordo commented 2 years ago

Also, HITNET looks nice, but it is quite slow. Currently, monocular depth estimation models (fastnet, Midas 2.1 small...) seem to be faster than the stereo ones (current ones are too complex with 3D convolutions and the cost aggregation). But, I still have hope that there is somewhere some fast stereo model :monocle_face:

PINTO0309 commented 2 years ago

It looks like the issue I posted has been triaged and escalated to the development team. I can somewhat predict that it will run faster if I reason with OpenVINO, so I will be patient and interact with it.

Luxonis-Brandon commented 2 years ago

Awesome - thanks!

ghost commented 2 years ago

Can Sb submit algorithm results to benchmark? https://vision.middlebury.edu/stereo/eval3/

gurbain commented 2 years ago

I was able to run real-time inference on TFLite HITNET Stereo depth estimation (middlebury) using OAK-D and having the inference on the host. Here are my results:

Hey,

Sorry for the spam but I am trying to reproduce the same example that you showed @nickjrz (stereo depth estimation on the host with an oak-d and hitnet) and I can't get as good results as you show. I actually started from the same project (https://github.com/ibaiGorordo/HITNET-Stereo-Depth-estimation) but it looks like my results are much worse than yours (maybe the pre-processing?). Could you maybe provide a link to your code, it would be really interesting. Thank you!

Luxonis-Brandon commented 2 years ago

@tersekmatija may be able to help advise too.

nickjrz commented 2 years ago

I was able to run real-time inference on TFLite HITNET Stereo depth estimation (middlebury) using OAK-D and having the inference on the host. Here are my results:

Hey,

Sorry for the spam but I am trying to reproduce the same example that you showed @nickjrz (stereo depth estimation on the host with an oak-d and hitnet) and I can't get as good results as you show. I actually started from the same project (https://github.com/ibaiGorordo/HITNET-Stereo-Depth-estimation) but it looks like my results are much worse than yours (maybe the pre-processing?). Could you maybe provide a link to your code, it would be really interesting. Thank you!

Hey @gurbain,

Some advice is to make sure you have the right parameters for the DepthAI stereo camera you are using such as baseline and focal length. You can also look at your input tensor and make sure it matches the input parameters of the model. I hope that helps!

PINTO0309 commented 2 years ago

https://github.com/ibaiGorordo/ONNX-HITNET-Stereo-Depth-estimation

https://github.com/ibaiGorordo/TFLite-HITNET-Stereo-depth-estimation

ibaiGorordo commented 2 years ago

First, make sure you get the correct disparity map by passing the rectified images to the model. For the disparity yiu should not need any other changes. If the disparity map does not look good, there might be a problem with the rectified images, and you might need to calibrate the board. Does the depthai depth map from the library demo look good?

For the depth, check the depthai documentation on how to get the depth from disparity: https://docs.luxonis.com/projects/api/en/latest/components/nodes/stereo_depth/#calculate-depth-using-dispairty-map

PINTO0309 commented 2 years ago

A very lightweight stereo depth estimation model. The conversion to OpenVINO was successful, but I am struggling with Myriad Blob because it does not support ExtractImagePatches. If there is an alternative way to standard operations, it may be possible to convert it. Any suggestions for replacing ExtractImagePatches with standard operations would be very welcome. The only workaround idea I can do right now is to offload only ExtractImagePatches to the CPU and stitch the model processing together.

ONNX, TFLite, OpenVINO (2MB - 11MB) https://github.com/PINTO0309/PINTO_model_zoo/tree/main/202_stereoDNN
Original Repo https://github.com/NVIDIA-AI-IOT/redtail/tree/master/stereoDNN
Paper https://arxiv.org/pdf/1803.09719.pdf
ExtractImagePatches https://docs.openvino.ai/latest/openvino_docs_ops_movement_ExtractImagePatches_3.html https://www.programcreek.com/python/?CodeExample=extract+patches

Screenshot 2021-12-14 08:27:32

tersekmatija commented 2 years ago

Also, a snippet of extract batches equivalent in TF: https://github.com/onnx/tensorflow-onnx/issues/436#issuecomment-993313423.

PINTO0309 commented 2 years ago

@tersekmatija Thank you. The output matched. Screenshot 2021-12-14 22:05:39

PINTO0309 commented 2 years ago

I have successfully replaced ExtractImagePatches with standard operations, but unfortunately I get an incomprehensible error when converting the subsequent Conv3D to Myriad Blob. The behavior of myriad_compile seems to be strange. :cry:

/home/jenkins/agent/workspace/private-ci/ie/build-linux-ubuntu20/b/repos/openvino/inference-engine/src/vpu/graph_transformer/src/frontend/frontend.cpp:439 Failed to compile layer "model/conv3d_8/Conv3D": [ GENERAL_ERROR ] 
/home/jenkins/agent/workspace/private-ci/ie/build-linux-ubuntu20/b/repos/openvino/inference-engine/src/vpu/graph_transformer/src/stages/convolution.cpp:404 number of biases must equal to number of output channels per group, but: channels per group=32, biases=1

PINTO0309 commented 2 years ago

I have confirmed that the problem below with errors occurring during HITNet conversion is resolved in OpenVINO 2022.1. In fact, I was able to convert to OpenVINO IR. https://github.com/luxonis/depthai/issues/173#issuecomment-918991375

HITNet OpenVINO IR FP16

However, when compiling to Myriad Blob, I encountered a new error, so I submitted a new issue again.

"[Bug] Const data got different desc and content byte sizes (24 and 96 respectively)" error when converting ConvolutionBackpropData using compile_tools #9517 https://github.com/openvinotoolkit/openvino/issues/9517

Luxonis-Brandon commented 2 years ago

Some other neat ones: https://cvlab-unibo.github.io/neural-disparity-refinement-web/ https://arxiv.org/abs/2110.15367

Luxonis-Brandon commented 2 years ago

We wrote an implementation of the paper above and also a training solution for it. Seems to be initially working and starting to train/converge OK-ish. Screenshot from 2022-04-11 20-54-44

Luxonis-Brandon commented 2 years ago

Rendering of the RGB scene was wrong above. Fixed now. Screenshot from 2022-04-11 21-38-33

Luxonis-Brandon commented 2 years ago

ecmnet commented 2 years ago

We wrote an implementation of the paper above and also a training solution for it. Seems to be initially working and starting to train/converge OK-ish.

Any code available already?

edgarriba commented 2 years ago

@Luxonis-Brandon what model are you using in the end? And what's the time performance for the results you show here ? Will be doable to run side to other detection and segmentation networks?

garybradski commented 2 years ago

Awesome. Of course, it would be nice to know how much of the camera compute resources are used by this in memory and time.

tersekmatija commented 2 years ago

Any code available already?

@ecmnet not yet, we are working on getting this out as soon as possible. We are doing a custom implementation of a model based on this paper: https://arxiv.org/pdf/2110.15367.pdf. I think the authors also link to their own implementation here: https://cvlab-unibo.github.io/neural-disparity-refinement-web/, but unfortunately you cannot run this directly on the camera. You can experiment on CPU / GPU. :smiley:

@Luxonis-Brandon what model are you using in the end? And what's the time performance for the results you show here ? Will be doable to run side to other detection and segmentation networks?

Awesome. Of course, it would be nice to know how much of the camera compute resources are used by this in memory and time.

@garybradski @edgarriba The model itself is very heavy, especially the MLP heads at the end. We are working on a lighter version of the model that will be suitable for devices and will likely need a few more iterations before we release it to public. Our first goal is to have the model run on device, with performance that makes it practical. For the first few iterations I'd say it would not be possible to run other NNs in addition to this on Gen2, but we want to achieve this in the future. Not yet sure whether this will be possible or not. Images that @Luxonis-Brandon shared above are from our first implementation of the heavy model, but we are starting to see some results with our lighter version as well. We'll share once we have more! :rocket:

edgarriba commented 2 years ago

@tersekmatija not sure how much are you planning to tweak the model but replacing mlp by any kind of separable convs might help to reduce the memory consumption. My approach would be, take a simple light unet style network, input 6xHxW (left/right rectified rbg and sgbm stereo) and output disparity which is a better representation. To compute depth you have the camera calibration.

ZlodeiBaal commented 2 years ago

Just found this in the repo - https://github.com/luxonis/depthai-experiments/tree/master/gen2-crestereo-stereo-matching Look nice! Pretty slow, but wow. It works even with glass! @Luxonis-Brandon is this the experiment that you mentioned above?

tersekmatija commented 2 years ago

Hey @ZlodeiBaal , that's a different model - CREStereo, which does pretty good on the stereo data (I am appending some images below).

We are investigating good practices and doing some experiments in the background. Screenshot from 2022-04-26 14-22-55

Screenshot from 2022-04-26 15-03-00

gurbain commented 2 years ago

Hi @ZlodeiBaal !

Reall nice, I think that CREStereo is one of the best model that I have tested and it is good to see it has been ported to the OAK-D. By "pretty slow", could you detail a bit more how long it takes approximately per image?

Luxonis-Brandon commented 2 years ago

Thanks @gurbain . Actually as above that's running on OAK-D. You can see 1.82 FPS in one case and 3.39 FPS in another case. So that gives an idea. In some applications this may be plenty fast actually. But others this may be way too slow.

gurbain commented 2 years ago

Thanks @Luxonis-Brandon! Did not see the FPS in the corner, my bad! :) Seems like a very good FPS given the CREStereo time performances indeed!

PINTO0309 commented 2 years ago

RealtimeStereo - Improvement status as of today https://github.com/JiaRenChang/RealtimeStereo Why not give it a try if you are interested? rtstereonet_maxdisp192_180x320.zip rtstereonet_maxdisp192_480x640.zip

PINTO0309 commented 2 years ago

rtstereonet_maxdisp192_720x1280.zip

Luxonis-Brandon commented 2 years ago

https://twitter.com/nburrus/status/1528750927037046784?s=21&t=-1nO4bfsI7ZhImVwyWTpQw

luxonis / depthai