Temporal consistency - Githubissues

isl-org / MiDaS

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

MIT License

4.24k stars 595 forks source link

Temporal consistency #174

Open ReiniertB opened 2 years ago

ReiniertB commented 2 years ago

Hi, I really love your models and they are extremely helpful. However, when I apply the model to a video, the scaling is very inconsistent. Does anyone have some tips with how to improve the temporal consistency when applying the model in a live setting? It does not have to be a ML kind of solution. Also, is there a way to constrain the depth maps? Like for example it only displays up to 10 cm a depth map, and beyond that it is just black? Thanks in advance.

vitacon commented 1 year ago

I'd like to improve the temporal consistency too. https://youtu.be/z6fK-kdMZNQ

I guess the depth maps can be "normalized" by some post processing and the same thing applies for your request of constraining the max depth but it could be rather tricky for moving camera and it could be too slow for "live setting"...

vitacon commented 1 year ago

Um, I see the values from .png can't be really consistent because they are "normalized" to fill the whole range of the output (e.g. 0-255)

out = max_val * (depth - depth_min) / (depth_max - depth_min)

I suppose that the right way how to keep the results consistent for different frames is to use values from PFM, analyze all files to find the global min and max and use these values to convert all PFMs to PNGs.

The hacky way could be looking into a few PMFs to see typical range of the results, make depth_min and depth_max constants and hope for the best... =}

hildebrandt-carl commented 1 year ago

Any pointers here would be appreciated :) I am unable to analyze all files, as I am using this on a robot that operates in the real world. The hacky way also doesn't work because the scaling is inconsistent over time.

vitacon commented 1 year ago

Well, I removed the relative normalization and used this:

depth_min = 0
depth_max = 12000
max_val = (2**(8*bits))-1
np.clip(depth, depth_min, depth_max, depth)

Of course, it did not help much. Sometimes MiDaS really surprises me how much different are its results with very similar input frames.

https://youtu.be/81ScNArJ-fE

Actually, the output frames are so varied I could not get reasonable results even with area-based normalization.

dronus commented 1 year ago

My guess is that the depth scale is somewhat arbitrary by the nature of the problem. To keep it consistent, data from multiple frames should be used. Like evaluating the network on current image and last image depth map oder some latent state of the last image(s).

Other projects either graft some magic around that using optical flow for camera pose estimation, or by re-training the network with temporal consistency rewards. But this means either retraining the network on actual evaluation of an individual movie, or having a network that just still works on single frames, only with better average consistency. Which may easily fail again of course.

dronus commented 1 year ago

Also a secondary network could be trained to filter the final depth images, eg. last three depth images in, one normalized image out. Like temporal hyperresolution.
That could easily be trained on movies with depth ground truth data.

ReiniertB commented 1 year ago

Thanks for all the replies. If anyone has a suggestion for depth estimation/prediction networks that can be easily trained without supervision, that would be greatly appreciated.

KexianHust commented 10 months ago

@ReiniertB We have developed a video depth estimation model ViTA based on MiDaS 3.0. Hope this can help you!

vitacon commented 10 months ago

@ReiniertB We have developed a video depth estimation model ViTA based on MiDaS 3.0.

I wonder why you stuck with MiDaS 3.0? What is wrong with 3.1 for you?

KexianHust commented 10 months ago

@ReiniertB We have developed a video depth estimation model ViTA based on MiDaS 3.0.

I wonder why you stuck with MiDaS 3.0? What is wrong with 3.1 for you?

Because our paper was submitted last year, at that time we can only use MiDaS 3.0. Of course, we would like to train a 3.1 version.

vitacon commented 10 months ago

Because our paper was submitted last year, at that time we can only use MiDaS 3.0.

I see. =) I got confused by this: [08/2023] Initial release of inference code and models.

Of course, we would like to train a 3.1 version.

"Would like" does mean you are planning doing it soon or is it more of a theoretical option? =}

KexianHust commented 10 months ago

Because our paper was submitted last year, at that time we can only use MiDaS 3.0.

I see. =) I got confused by this: [08/2023] Initial release of inference code and models.

Of course, we would like to train a 3.1 version.

"Would like" does mean you are planning doing it soon or is it more of a theoretical option? =}

I will release the 3.1 version once the models are trained.

RaymondWang987 commented 9 months ago

@ReiniertB @vitacon Our work Neural Video Depth Stabilizer (NVDS) is accepted by ICCV2023. NVDS can stabilize any single-image depth predictors in a plug-and-play manner without additional training and any extra effort. We have tried NVDS with MiDaS, DPT, MiDaS 3.1, and NewCRFs. The results are quite satisfactory. You can simply change the depth predictor to MiDaS 3.1 (only adjusting one line in our demo code) and our NVDS can produce significant improvement in temporal consistency.

CJCHEN1230 commented 3 months ago

@KexianHust Hi, I'm really interested in your work. It seems like you haven't made your papers public. Could you share your paper link?