isl-org / MiDaS

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"
MIT License
4.25k stars 597 forks source link

DEPTH VALUE OF THE EACH PIXEL #261

Open AdnanErdogn opened 6 months ago

AdnanErdogn commented 6 months ago

How can i reach this information on midas ?

heyoeyo commented 6 months ago

The midas models output inverse depth maps (or images). So each pixel of the output corresponds to a value like: 1/depth

However, the mapping is also only relative, it doesn't tell you the exact (absolute) depth. Aside from noise/errors, the true depth value is shifted/scaled compared to the result you get from the midas output after inverting, so more like:

true depth = A + B * (1 / midas output) (see post below)

Where A is some offset and B is some scaling factor, that generally aren't knowable using the midas models alone. You can try something like ZoeDepth to get actual depth values or otherwise try fitting the midas output to some other reference depth map, like in issue #171

JoshMSmith44 commented 6 months ago

According to #171 I believe the equation is: (1.0 / true_depth) = A + (B * midas_output) so then true_depth = 1.0 / (A + (B * midas_output))

heyoeyo commented 6 months ago

so then true_depth = 1.0 / (A + (B * midas_output))

Good point! I was thinking these are the same mathematically, but there is a difference, and having the shifting done before inverting makes more sense.

Eyshika commented 5 months ago

How are A and B calculated for a video ? @JoshMSmith44

JoshMSmith44 commented 5 months ago

How are A and B calculated for a video ? @JoshMSmith44

I believe MiDas is a single-image method and therefore there is a different A and B for each frame in the video sequence.

Eyshika commented 5 months ago

How are A and B calculated for a video ? @JoshMSmith44

I believe MiDas is a single-image method and therefore there is a different A and B for each frame in the video sequence.

but in MIDAS it calculates using, true depth and calculated depth comparison. What if we have completely new images and want to find metric depth ?

JoshMSmith44 commented 5 months ago

In order to get the true depth using the above method you need to know two true depth pixel values for each relative depth image you correct (realistically you want many more). This could come from a a sensor, sparse structure-from-motion point cloud, etc. if you don't have access to true depth and you need access to metric depth then you should look into Metric depth estimation methods like ZoeDepth, Depth-Anything,and ZeroDepth.

puyiwen commented 2 months ago

midas 模型输出逆深度图(或图像)。因此输出的每个像素对应一个值,例如:1/depth

然而,映射也只是相对的,它并不能告诉你确切的(绝对)深度。除了噪声/误差之外,与反转后从 midas 输出获得的结果相比,真实深度值会发生偏移/缩放,因此更像:

~真实深度 = A + B * (1 / midas 输出)~ (见下面的帖子)

哪里A是一些偏移量,哪里B是一些比例因子,通常仅使用 midas 模型是无法得知的。您可以尝试类似ZoeDepth 的方法来获取实际深度值,或者尝试将 midas 输出拟合到其他参考深度图,如问题#171中所示

hi, if I just use midas output, which you said the inverse depth, to train my model. I want to get the relative depth for an image. Am I do something wrong?

puyiwen commented 2 months ago

The midas models output inverse depth maps (or images). So each pixel of the output corresponds to a value like: 1/depth

However, the mapping is also only relative, it doesn't tell you the exact (absolute) depth. Aside from noise/errors, the true depth value is shifted/scaled compared to the result you get from the midas output after inverting, so more like:

~true depth = A + B * (1 / midas output)~ (see post below)

Where A is some offset and B is some scaling factor, that generally aren't knowable using the midas models alone. You can try something like ZoeDepth to get actual depth values or otherwise try fitting the midas output to some other reference depth map, like in issue #171

Hi, @heyoeyo, I want to know how the metric depth dataset(like DIML) and relative depth dataset(like RedWeb) to train together? Dose change metric depth dataset to relative depth dataset first? Can you help me? Thank you very much!!

heyoeyo commented 1 month ago

One of the MiDaS papers describes how the data is processed for training. The explanation starts on page 5, under the section: Training on Diverse Data

There they describe several approaches they considered, which are later compared on plots (see page 7) showing that the combination of the 'ssitrim + reg' loss functions worked the best. These loss functions are both described on page 6 (equations 7 & 11).

The explanation just above the 'ssitrim' loss is where they describe how different data sets are handled. The basic idea is that they first run their model on an input image to get a raw prediction, which is then normalized (using equation 6 in the paper). They repeat the same normalization procedure for the ground truth, and then calculate the error as: abs(normalized_prediction - normalized_ground_truth_disparity) Which is calculated for each 'pixel' in the prediction and summed together. For the 'ssitrim' loss specifically, they ignore the top 20% largest errors when calculating the sum.

So due to the normalization step, both relative & metric depth data sources should be able to be processed/trained using the same procedure.

puyiwen commented 1 month ago

@heyoeyo , thank you for your reply. And I have another question about relative depth evaluation. Why the output of model( relative depth) should be converted to metric depth, and evaluate at metric depth dataset, like NYU, KITTY, using the rmse、abs_rel.eg.? Why not just use the relative depth dataset for evaluation?

heyoeyo commented 1 month ago

I think it depends on what the evaluation is trying to show. Converting to metric depth would have the effect of more heavily weighting errors on scenes that have wider depth ranges. For example a 10% error on an indoor scene with elements that are only 10m away would be a 1m error, whereas a 10% error on an outdoor scene with objects 100m away would have a 10m error, and that might be something the authors want to prioritize (i.e. model accuracy across very large depth ranges).

It does some strange to me that the MiDaS paper converted some results to metric depth for their experiments section though. Since it seems they just used a least squares fit to align the relative depth results with the metric ground truth (described on pg 7), it really feels like this just over-weights the performance of the model on outdoor scenes.

It makes a lot more sense to do the evaluation directly in absolute depth for something like ZoeDepth, where the model is directly predicting the metric values and therefore those 1m vs 10m errors are actually relevant to the model's capability. (but I might be missing something, I haven't really worked with metric depth data myself)