Open talasalim opened 1 year ago
Could you describe more what problems are you facing? The output of the model is the metric depth. If you think units are wildly inaccurate, try with config_mode=eval
while loading the model. You can choose to use ZoeD_N for indoor scenes, ZoeD_K for outdoor road scenes, and ZoeD_NK for generic scenes
@shariqfarooq123 I think @talasalim means how to get to a metric distance (like e.g. meters) from two known x,y coordinates (or distance from Camera to object surface) of the original picture, when providing two x,y coordinates which are known to be a fixed length in the picture for calibartion.
@shariqfarooq123 @Teifoc Yes that is what I meant. Is there a way to get the absolute metric depth at a certain x,y coordinate?
Following up here. I think you might need to provide the camera intrinsics that are unique per camera but I'm assuming these are known for the dataset in question. @talasalim @shariqfarooq123 @Teifoc any ideas?
Under the file geometry.py
I found two functions get_intrinsics
and depth_to_points
. I think if we change depth_to_points
to this, we can actually define the camera intrinsics and extrinsics as we want:
def depth_to_points(depth, K=None, R=None, t=None):
if K is None:
K = get_intrinsics(depth.shape[1], depth.shape[2])
Kinv = np.linalg.inv(K)
if R is None:
R = np.eye(3)
if t is None:
t = np.zeros(3)
Folowing on this, does somebody know which unit is used for the metric depth ? Comparing my results to ground truth data, ranging from 5 to 45 meters, i have values from 1200 to 8400 in my ZoeDepth output. Is this supposed to be millimeters ? Steps of 5 mm ?
Hallo, sorry I'm quite a newbie here. so, are the numbers you were mentioning are the result from zoe.infer_pil(image)? and we can directly use that to know the estimation of the metric depth value? or is there any other steps to get that?
although the model is trained to predict metric depth, due to the limited data size, I think the prediction is still not metric accurate, but should be scale aware (i.e. if an object is twice as far as another, even if the absolute depth is incorrect, the proportional should be the same). In short I think the number is still "up to some scale"
Honestly, i have pretty good results taking directly the ouptut from zoe.infer_pil(image) as millimeters, but some of these algorithms do provide an output equivalent to MetricDepth = Scale*OutputDepth + Shift, where scale and shift are dependant of your camera parameters. If you're not sure about that, you can use linear regression to estimate those parameters, given that you have ground truth.
The model is trained to predict meters though
Could you describe more what problems are you facing? The output of the model is the metric depth. If you think units are wildly inaccurate, try with
config_mode=eval
while loading the model. You can choose to use ZoeD_N for indoor scenes, ZoeD_K for outdoor road scenes, and ZoeD_NK for generic scenes
Well it says that the output is metric, not meters right ? At least in my case, if the output is actually meters, it would be insanely inaccurate.
the depth in training and eval is converted to meters: https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/data/data_mono.py#L353-L354 https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/data/ddad.py#L98 https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/data/diml_indoor_test.py#L97-L98
As @kwea123 pointed out, the model was trained with meters as units for depth. So the output is always supposed to be in meters. However, the input padding in the infer
and infer_pil
API may easily change the overall scale of the output but should be more or less consistent.
Try turning the padding off with pad_input=False
(at the cost of border artifacts, see zoedepth.models.depth_model:L57)
TLDR:
import torch
zoe = torch.hub.load("isl-org/ZoeDepth", "ZoeD_N", pretrained=True)
predicted_depth = zoe.infer_pil(image, pad_input=False) # Better 'metric' accuracy
Let me know if this helps
Okay thanks a lot! I was actually using the save_raw_16bit function from misc.py, which multiply all values by 256.
def save_raw_16bit(depth, fpath="raw.png"):
if isinstance(depth, torch.Tensor):
depth = depth.squeeze().cpu().numpy()
assert isinstance(depth, np.ndarray), "Depth must be a torch tensor or numpy array"
assert depth.ndim == 2, "Depth must be 2D"
depth = depth * 256 # scale for 16-bit png
depth = depth.astype(np.uint16)
depth = Image.fromarray(depth)
depth.save(fpath)
print("Saved raw depth to", fpath)
No wonder i had bad metrics while comparing to ground truth... Thanks for pointing that out!
Interesting! So now are you able to reproduce the ground truth metric depth?
Well it sure is better than before, but it stills struggle with the background of my ground truth. Here is what it looks like : The background is ~30 meters farer than predicted. Also, i should mention that i used the zoedepth_nk model.
Folowing on this, does somebody know which unit is used for the metric depth ? Comparing my results to ground truth data, ranging from 5 to 45 meters, i have values from 1200 to 8400 in my ZoeDepth output. Is this supposed to be millimeters ? Steps of 5 mm ?
If you look at the code of the utility function save_as_raw_16bit (or something like that ), you'll see they get the data , multiply it by 256 and round it off ro unsigned 16bit integere (so 0 - 65535) .
That means you can a) Use the raw data yourself , since it are floating point numbers that represent meters as far as I know (model can be off ofcourse).
Or b) read the raw 16bit integere in that you might already have , divide the values by 256 to get close to the original float output of the model.
The values you mention divided by 256 come closer to what you describe as the values you are looking for .
(Edit: upon reloading I now see there were already replies and this has been said before. Sorry . When I opened the issue that part of the discussion wasn't visible to me )
Well it sure is better than before, but it stills struggle with the background of my ground truth. Here is what it looks like : The background is ~30 meters farer than predicted. Also, i should mention that i used the zoedepth_nk model.
When I use the function save_raw_16bit, I only got a totally black picture. How do you get the real distance ? Which function do you use? Thank you for your answer!
If using save_raw_16bit: You get back a greyscale image, in other words you get back width x height, and for every point a number between >= 0 and <= 65535. That is the 16bit integer range.
Divide that number by 256 to get what the model predicts as meters. Of course it depends on camera, model accuracy and upscaling and all that. But the numbers save_raw_16bit returns are meters multiplied by 256. So divide by 256 to get back some sort of meters.
If using save_raw_16bit: You get back a greyscale image, in other words you get back width x height, and for every point a number between >= 0 and <= 65535. That is the 16bit integer range.
Divide that number by 256 to get what the model predicts as meters. Of course it depends on camera, model accuracy and upscaling and all that. But the numbers save_raw_16bit returns are meters multiplied by 256. So divide by 256 to get back some sort of meters.
Thank you for your help! My code was like this:
image = Image.open("image.png").convert("RGB")
model_zoe_n = torch.hub.load(".", "ZoeD_NK", pretrained=True, source="local")
DEVICE = "cuda:1" if torch.cuda.is_available() else "cpu"
zoe = model_zoe_n.to(DEVICE)
depth = zoe.infer_pil(image)
I find that the numbers save_raw_16bit returns are depth multiplied by 256.So I think the depth there should be the real distance of the photo? If I am right, the result is bad. Maybe the reason is that the camera is too close to the object in my photo. It is only about 20 cm far from my camera.
Well it sure is better than before, but it stills struggle with the background of my ground truth.嗯,它确实比以前更好,但它仍然与我的基本事实背景相斗争。 Here is what it looks like : 它是这样的: The background is ~30 meters farer than predicted. 背景比预计远约 30 米。 Also, i should mention that i used the zoedepth_nk model. 另外,我应该提到我使用了 zoedepth_nk 模型。
May I ask how you generated your result graph?
Well it sure is better than before, but it stills struggle with the background of my ground truth. Here is what it looks like : The background is ~30 meters farer than predicted. Also, i should mention that i used the zoedepth_nk model.
Hello, can you please tell me how you generate ground truth for an image? I want to compare too my predicted depth with ground truth. Thanks!
You can't generate the ground thruth, you have to actually measure it. You have two options (that i know of) :
Well it sure is better than before, but it stills struggle with the background of my ground truth.嗯,它确实比以前更好,但它仍然与我的基本事实背景相斗争。 Here is what it looks like : 它是这样的: The background is ~30 meters farer than predicted. 背景比预计远约 30 米。 Also, i should mention that i used the zoedepth_nk model. 另外,我应该提到我使用了 zoedepth_nk 模型。
May I ask how you generated your result graph?
Sorry, i just saw your question. Which result graph are you talking about ? For the 3 of them, i plotted the output matrix. I don't think i still have the code i used.
How can I use my data to get the metric depth at a pixel level using the ZoeD model?