Closed alex-bene closed 3 weeks ago
Hi @alex-bene,
Thanks for opening this issue and writing up such a detailed report - it's greatly appreciated.
Yes, indeed, there should be proper processing for the model's outputs (in fact - this was what triggered the issue)
With regards to the specific points in the TLDR:
Currently, the output from Transformers for this model also includes padding. This is very weird since it is not mentioned anywhere in the documentation while at the same time, all the other depth models do not do this.
Here, I share a post_processing_depth_estimation function that removes the padding, resizes the model output to match the input image, and additionally returns a depth image.
In an ideal world, to be coherent with the rest of the models and avoid confusion, I believe that the output of the ZoeDepth model should somehow be already cropped to remove the padding; however this is not very easy considering its inputs (an input for source_size should be added in this case).
Still, after using the function above, there is a discrepancy between the output I get using HF vs using the original implementation
There could be a multitude of things happening here, as the call to the torch hub model is including any pre- and post- processing, as well as the model's forward pass. cc @NielsRogge regarding differences observed when porting
Hi,
Thanks for the detailed report, I ported the model by performing inference on the original repository and then making sure both the preprocessing + a forward pass matches with the HF implementation. The script I used on the original implementation can be found here: https://github.com/isl-org/ZoeDepth/compare/main...NielsRogge:ZoeDepth:understanding_zoedepth?expand=1 (I ran the inference.py script). The logits and preprocessing is verified in the conversion script: https://github.com/huggingface/transformers/blob/main/src/transformers/models/zoedepth/convert_zoedepth_to_hf.py
The notebook link by @NielsRogge probably has some code to show how the padding is added and removed, however since I do not have access to it, I'll share here the various info I have gathered as well as a postprocessing function roughly following the style of post_process_object_detection.
The notebook is available here: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ZoeDepth/Inference_with_ZoeDepth.ipynb.
Hey @amyeroberts and thanks for the quick response,
Very nice! Would you like to open a PR to add this? This way you get the GitHub contribution for the work you've done.
I'll probably get to it from tomorrow, thanks.
It's not uncommon to pad images before passing them to the model such that they can be batched. I'm surprised this is being applied even for a single image cc @NielsRogge to confirm the intended behaviour here.
Indeed, however, the padding here is not for the inputs to be batched, but rather to fix the boundary artifacts in the output depth map. Additionally, it is a dynamic padding (based on the size of the input image) which makes removing it more complicated (i.e., I had to search the source code of the ImageProcessor or the original implementation to check how much padding was added and at what stage). Still, since all the rest of the depth models do not output a padded image, nor does the original implementation (it has integrated the post-processing step), I believe a small note on the docs will help a lot of future devs trying to use the model.
@NielsRogge I looked over what you sent and indeed it seems to suggest a match between the HF implementation and the original repo. Maybe the discrepancy in my code has something to do with the padding removal (?). I will do a more in-depth check and get back to you.
Hey, so I revisited this today.
It turns out that on the original repo, they infer the image as well as a horizontal flip of the image and then average out the results as can be seen here.
So, to have a 1-1 mapping with the original implementation we need to do the following:
from PIL import Image
import requests
import torch
import numpy as np
from transformers import AutoImageProcessor, ZoeDepthForDepthEstimation
image_processor = AutoImageProcessor.from_pretrained("Intel/zoedepth-nyu")
model = ZoeDepthForDepthEstimation.from_pretrained("Intel/zoedepth-nyu")
# prepare image for the model
url = "https://www.greece-is.com/wp-content/uploads/2016/07/ATH_RIVIERA_naos-poseidona-sounio-01.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=[image, image], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
outputs_flip = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
processed_output = post_process_depth_estimation_zoedepth(outputs, [image.size[::-1]]*2, outputs_flip=outputs_flip)[0]
print("Input image size (h, w):", image.size[::-1])
print("Output predicted depth shape (h, w):", processed_output["predicted_depth"].shape)
print("Output depth image size (h, w):", processed_output["depth"].size[::-1])
processed_output["depth"].show()
Where the post_process_depth_estimation_zoedepth
function is updated to accept the flipped outputs:
from typing import Union, List, Tuple, Dict
from PIL import Image
import torch
from torch.nn import functional as F
from transformers.models.zoedepth.modeling_zoedepth import ZoeDepthDepthEstimatorOutput
def post_process_depth_estimation_zoedepth(
outputs: ZoeDepthDepthEstimatorOutput,
source_sizes: Union[torch.Tensor, List[Tuple[int, int]]],
target_sizes: Union[torch.Tensor, List[Tuple[int, int]]] = None,
outputs_flip: ZoeDepthDepthEstimatorOutput | None = None,
remove_padding: bool = True,
) -> List[Dict] :
"""
Converts the raw output of [`ZoeDepthDepthEstimatorOutput`] into final depth predictions and depth PIL image.
Only supports PyTorch.
Args:
outputs ([`ZoeDepthDepthEstimatorOutput`]):
Raw outputs of the model.
outputs_flip ([`ZoeDepthDepthEstimatorOutput`], *optional*):
Raw outputs of the model from flipped input (averaged out in the end).
source_sizes (`torch.Tensor` or `List[Tuple[int, int]]`):
Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the source size
(height, width) of each image in the batch before preprocessing.
target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*):
Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
(height, width) of each image in the batch. If left to None, predictions will not be resized.
remove_padding (`bool`):
By default ZoeDepth addes padding to fix the boundary artifacts in the output depth map, so we need
remove this padding during post_processing. The parameter exists here in case the user changed the
image preprocessing to not include padding.
Returns:
`List[Dict]`: A list of dictionaries, each dictionary containing the depth predictions and a depth PIL
image as predicted by the model.
"""
predicted_depth = outputs.predicted_depth
if (outputs_flip is not None) and (predicted_depth.shape != outputs_flip.predicted_depth.shape):
raise ValueError(
"Make sure that `outputs` and `outputs_flip` have the same shape"
)
if (target_sizes is not None) and (len(predicted_depth) != len(target_sizes)):
raise ValueError(
"Make sure that you pass in as many target sizes as the batch dimension of the predicted depth"
)
if (source_sizes is None) or (len(predicted_depth) != len(source_sizes)):
raise ValueError(
"Make sure that you pass in as many source image sizes as the batch dimension of the logits"
)
if outputs_flip is not None:
predicted_depth = torch.stack([predicted_depth, outputs_flip.predicted_depth], dim=1)
else:
predicted_depth = predicted_depth.unsqueeze(1)
# Zoe Depth model adds padding around the images to fix the boundary artifacts in the output depth map
# The padding length is `int(np.sqrt(img_h/2) * fh)` for the height and similar for the width
# fh (and fw respectively) are equal to '3' by default
# Check [here](https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/models/depth_model.py#L57)
# for the original implementation.
# In this section, we remove this padding to get the final depth image and depth prediction
if isinstance(source_sizes, List):
img_h = torch.Tensor([i[0] for i in source_sizes])
img_w = torch.Tensor([i[1] for i in source_sizes])
else:
img_h, img_w = source_sizes.unbind(1)
fh = fw = 3
results = []
for i, (d, s) in enumerate(zip(predicted_depth, source_sizes)):
if remove_padding:
pad_h = int(np.sqrt(s[0]/2) * fh)
pad_w = int(np.sqrt(s[1]/2) * fw)
d = F.interpolate(
d.unsqueeze(1), size=[s[0] + 2*pad_h, s[1] + 2*pad_w],
mode="bicubic", align_corners=False
)
if pad_h > 0:
d = d[:, :, pad_h:-pad_h, :]
if pad_w > 0:
d = d[:, :, :, pad_w:-pad_w]
if target_sizes is not None:
target_size = target_sizes[i]
d = F.interpolate(d, size=target_size, mode="bicubic", align_corners=False)
if outputs_flip != None:
d, d_f = d.chunk(2)
d = (d + torch.flip(d_f, dims=[-1])) / 2
d = d.squeeze().cpu().numpy()
pil = Image.fromarray((d * 255 / np.max(d)).astype("uint8"))
results.append({"predicted_depth": d, "depth": pil})
return results
With these changes, when we compare with the original implementation we get:
Max error: 6.7949295e-05
Min error: -4.7922134e-05
MSE: 2.5678112e-11
RMSE: 5.0673575e-06
MAE: 3.975931e-06
So this is fixed.
do_pad
inside the image_processor
However, for the PR, to "better" incorporate this flipped input/output with the model, I was wondering whether it would be better to add an input do_flip
to the image processor (similar to the do_pad
) or leave it to each user to run the flipped input through the model and then passing it to the post-processor. --> I think I prefer the incorporation of the functionality into the image_processor
, but I wanted to get your opinion @NielsRogge .
If the image is originally flipped, IMO it makes sense to use do_flip
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers==4.43.0
Who can help?
@amyeroberts @stevhliu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hello everyone,
I stumbled across #30917 while trying to figure out what was going on with the output of the ZoeDepth model. However, since this issue is quite a bit more general (about depth estimators) I open this issue specifically about the outputs of ZoeDepth.
Explaining/Showing the problem
For this model, the
ImageProcessor
adds reflection padding around the input images to fix the boundary artifacts in the output depth map. As a result, the depth predictions and image outputted by the model include padding; in contrast to every other depth model here.This can be easily seen by running the code below:
that produces the following image:
For reference, here's the same image and the same code, but with
do_pad=False
inside theimage_processor
call.Solution
The notebook link by @NielsRogge probably has some code to show how the padding is added and removed, however since I do not have access to it, I'll share here the various info I have gathered as well as a postprocessing function roughly following the style of
post_process_object_detection
.Finding out exactly what's going on under the hood
As you can see in the original repo as well as in the ZoeDepthImageProcessor, before inference, the images padded in both dimensions by:
Where
fh
andfw
are equal to3
by default. Then, the images are resized and are fed into the model.Thus, to get the final depth predictions and image corresponding to the input image, you need to:
ZoeDepth
post_process_depth_estimation
Which you can double-check using the testing code below:
The test code should output the following image:
Default
post_process_depth_estimation
For the sake of completeness, I also share the default
post_process_depth_estimation
function for the rest of the models that do not have padded outputs:Expected behavior
To be coherent with the rest of the depth estimation models, the ideal scenario would be for the ZoeDepth to output directly a cropped image where the padding would be removed. However, I understand that this is not very easy considering that an additional input for
source_size
should probably be added in this case.Comparison with official source implementation
Still, when comparing the outputs of the official implementation with the model in HF with the post-processing above, there is a small discrepancy and I am not yet sure who to blame. To use the official implementation, run the code below:
The output image looks very much like the one before:
However, when comparing the outputs here with the outputs before
we get
Conclusion - TL;DR
post_processing_depth_estimation
function that removes the padding, resizes the model output to match the input image, and additionally returns a depth image.source_size
should be added in this case).