ZoeDepth outputs include padding --> not referenced in the docs & solution is not obvious

System Info

transformers==4.43.0

Who can help?

@amyeroberts @stevhliu

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Hello everyone,

I stumbled across #30917 while trying to figure out what was going on with the output of the ZoeDepth model. However, since this issue is quite a bit more general (about depth estimators) I open this issue specifically about the outputs of ZoeDepth.

Explaining/Showing the problem

For this model, the ImageProcessor adds reflection padding around the input images to fix the boundary artifacts in the output depth map. As a result, the depth predictions and image outputted by the model include padding; in contrast to every other depth model here.

This can be easily seen by running the code below:

from transformers import AutoImageProcessor
from PIL import Image
import requests

url = "https://www.greece-is.com/wp-content/uploads/2016/07/ATH_RIVIERA_naos-poseidona-sounio-01.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("Intel/zoedepth-nyu")
img = image_processor(
    images=image, return_tensors="pt", do_normalize=False
)["pixel_values"].squeeze().permute(1, 2, 0).cpu().numpy()
pil = Image.fromarray((img * 255 / img.max()).astype("uint8"))
pil.thumbnail((512, 512))
pil.show()

that produces the following image:

For reference, here's the same image and the same code, but with do_pad=False inside the image_processor call.

Solution

[!IMPORTANT]
While the issue is still open, please add a note in the docs about this discrepancy between ZoeDepth and the rest of the depth prediction models; it took me more time than it should to understand what's going on. @stevhliu

The notebook link by @NielsRogge probably has some code to show how the padding is added and removed, however since I do not have access to it, I'll share here the various info I have gathered as well as a postprocessing function roughly following the style of post_process_object_detection.

Finding out exactly what's going on under the hood

As you can see in the original repo as well as in the ZoeDepthImageProcessor, before inference, the images padded in both dimensions by:

pad_h = int(np.sqrt(img_height/2) * fh) # height padding
pad_w = int(np.sqrt(img_width/2) * fw) # width padding

Where fh and fw are equal to 3 by default. Then, the images are resized and are fed into the model.

Thus, to get the final depth predictions and image corresponding to the input image, you need to:

Resize the output to the size of the input image PLUS the padding
Remove the padding

ZoeDepth `post_process_depth_estimation`

from typing import Union, List, Tuple, Dict
from PIL import Image

import torch
from torch.nn import functional as F
from transformers.models.zoedepth.modeling_zoedepth import ZoeDepthDepthEstimatorOutput

def post_process_depth_estimation_zoedepth(
    outputs: ZoeDepthDepthEstimatorOutput,
    source_sizes: Union[torch.Tensor, List[Tuple[int, int]]],
    target_sizes: Union[torch.Tensor, List[Tuple[int, int]]] = None,
    remove_padding: bool = True,
) -> List[Dict] :
    """
    Converts the raw output of [`ZoeDepthDepthEstimatorOutput`] into final depth predictions and depth PIL image.
    Only supports PyTorch.

    Args:
        outputs ([`ZoeDepthDepthEstimatorOutput`]):
            Raw outputs of the model.
        source_sizes (`torch.Tensor` or `List[Tuple[int, int]]`):
            Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the source size
            (height, width) of each image in the batch before preprocessing.
        target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*):
            Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
            (height, width) of each image in the batch. If left to None, predictions will not be resized.
        remove_padding (`bool`):
            By default ZoeDepth addes padding to fix the boundary artifacts in the output depth map, so we need
            remove this padding during post_processing. The parameter exists here in case the user changed the
            image preprocessing to not include padding.

    Returns:
        `List[Dict]`: A list of dictionaries, each dictionary containing the depth predictions and a depth PIL
        image as predicted by the model.
    """
    predicted_depth = outputs.predicted_depth

    if (target_sizes is not None) and (len(predicted_depth) != len(target_sizes)):
        raise ValueError(
            "Make sure that you pass in as many target sizes as the batch dimension of the predicted depth"
        )

    if (source_sizes is None) or (len(predicted_depth) != len(source_sizes)):
        raise ValueError(
            "Make sure that you pass in as many source image sizes as the batch dimension of the logits"
        )

    # Zoe Depth model adds padding around the images to fix the boundary artifacts in the output depth map
    # The padding length is `int(np.sqrt(img_h/2) * fh)` for the height and similar for the width
    # fh (and fw respectively) are equal to '3' by default
    # Check [here](https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/models/depth_model.py#L57)
    # for the original implementation.
    # In this section, we remove this padding to get the final depth image and depth prediction
    if isinstance(source_sizes, List):
        img_h = torch.Tensor([i[0] for i in source_sizes])
        img_w = torch.Tensor([i[1] for i in source_sizes])
    else:
        img_h, img_w = source_sizes.unbind(1)

    fh = fw = 3

    results = []
    for i, (d, s) in enumerate(zip(predicted_depth, source_sizes)):
        if remove_padding:
            pad_h = int(np.sqrt(s[0]/2) * fh)
            pad_w = int(np.sqrt(s[1]/2) * fw)
            d = F.interpolate(
                d.unsqueeze(0).unsqueeze(1), size=[s[0] + 2*pad_h, s[1] + 2*pad_w],
                mode="bicubic", align_corners=False
            )

            if pad_h > 0:
                d = d[:, :, pad_h:-pad_h, :]
            if pad_w > 0:
                d = d[:, :, :, pad_w:-pad_w]

        if target_sizes is not None:
            target_size = target_sizes[i]
            d = F.interpolate(d, size=target_size, mode="bicubic", align_corners=False)

        d = d.squeeze().cpu().numpy()
        pil = Image.fromarray((d * 255 / np.max(d)).astype("uint8"))
        results.append({"predicted_depth": d, "depth": pil})

    return results

Which you can double-check using the testing code below:

from PIL import Image
import requests

import torch
import numpy as np
from transformers import AutoImageProcessor, ZoeDepthForDepthEstimation

image_processor = AutoImageProcessor.from_pretrained("Intel/zoedepth-nyu")
model = ZoeDepthForDepthEstimation.from_pretrained("Intel/zoedepth-nyu")

# prepare image for the model

url = "https://www.greece-is.com/wp-content/uploads/2016/07/ATH_RIVIERA_naos-poseidona-sounio-01.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

processed_output = post_process_depth_estimation_zoedepth(outputs, [image.size[::-1]])[0]
print("Input image size (h, w):", image.size[::-1])
print("Output predicted depth shape (h, w):", processed_output["predicted_depth"].shape)
print("Output depth image size (h, w):", processed_output["depth"].size[::-1])
processed_output["depth"].show()

The test code should output the following image:

Default `post_process_depth_estimation`

For the sake of completeness, I also share the default post_process_depth_estimation function for the rest of the models that do not have padded outputs:

from typing import Union, List, Tuple, Dict
from PIL import Image

import torch
from torch.nn import functional as F
from transformers.models.zoedepth.modeling_zoedepth import ZoeDepthDepthEstimatorOutput

def post_process_depth_estimation_default(
    outputs, target_sizes: Union[torch.Tensor, List[Tuple[int, int]]] = None
) -> List[Dict] :
    """
    Converts the raw output of [`*DepthEstimatorOutput`] into final depth predictions and depth PIL image.
    Only supports PyTorch.

    Args:
        outputs ([`*DepthEstimatorOutput`]):
            Raw outputs of the model.
        target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*):
            Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
            (height, width) of each image in the batch. If left to None, predictions will not be resized.

    Returns:
        `List[Dict]`: A list of dictionaries, each dictionary containing the depth predictions and a depth PIL
        image as predicted by the model.
    """
    predicted_depth = outputs.predicted_depth

    if (target_sizes is not None) and (len(predicted_depth) != len(target_sizes)):
        raise ValueError(
            "Make sure that you pass in as many target sizes as the batch dimension of the predicted depth"
        )

    results = []
    for i, d in enumerate(predicted_depth):
        if target_sizes is not None:
            target_size = target_sizes[i]
            d = F.interpolate(d, size=target_size, mode="bicubic", align_corners=False)
        d = d.squeeze().cpu().numpy()
        pil = Image.fromarray((d * 255 / np.max(d)).astype("uint8"))

        results.append({"predicted_depth": d, "depth": pil})

    return results

Expected behavior

To be coherent with the rest of the depth estimation models, the ideal scenario would be for the ZoeDepth to output directly a cropped image where the padding would be removed. However, I understand that this is not very easy considering that an additional input for source_size should probably be added in this case.

Comparison with official source implementation

To run the code in this section, you'll need timm==0.6.11

Still, when comparing the outputs of the official implementation with the model in HF with the post-processing above, there is a small discrepancy and I am not yet sure who to blame. To use the official implementation, run the code below:

from PIL import Image

import torch

model = torch.hub.load('isl-org/ZoeDepth', "ZoeD_N", pretrained=True).eval()
orig_output = model.infer_pil(image, output_type="numpy")
orig_pil = Image.fromarray((orig_output * 255 / orig_output.max()).astype("uint8"))

print("Input image size (h, w):", image.size[::-1])
print("Output predicted depth shape (h, w):", orig_output.shape)
print("Output depth image size (h, w):", orig_pil.size[::-1])
orig_pil.show()

The output image looks very much like the one before:

However, when comparing the outputs here with the outputs before

import numpy as np

error = orig_output - processed_output["predicted_depth"]

print("Max error:", error.max())
print("Min error:", error.min())
print("MSE:", (error**2).mean())
print("RMSE:", np.sqrt((error**2).mean()))
print("MAE:", (np.abs(error).mean()))

Image.fromarray((error * 255 / error.max()).astype("uint8")).show()

we get

Max error: 0.7920196
Min error: -0.21409726
MSE: 0.0025115514
RMSE: 0.05011538
MAE: 0.033222165

Conclusion - TL;DR

ZoeDepth adds padding to the images during preprocessing.
Currently, the output from Transformers for this model also includes padding. This is very weird since it is not mentioned anywhere in the documentation while at the same time, all the other depth models do not do this.
Here, I share a post_processing_depth_estimation function that removes the padding, resizes the model output to match the input image, and additionally returns a depth image.
In an ideal world, to be coherent with the rest of the models and avoid confusion, I believe that the output of the ZoeDepth model should somehow be already cropped to remove the padding; however this is not very easy considering its inputs (an input for source_size should be added in this case).
Still, after using the function above, there is a discrepancy between the output I get using HF vs using the original implementation

Hi @alex-bene,

Thanks for opening this issue and writing up such a detailed report - it's greatly appreciated.

Yes, indeed, there should be proper processing for the model's outputs (in fact - this was what triggered the issue)

With regards to the specific points in the TLDR:

Currently, the output from Transformers for this model also includes padding. This is very weird since it is not mentioned anywhere in the documentation while at the same time, all the other depth models do not do this.

It's not uncommon to pad images before passing to the model such that they can be batched. I'm surprised this is being applied even for a single image cc @NielsRogge to confirm the intended behaviour here.
We don't normally note this kind of operation - it's not specified for other vision models that do this, nor for audio or text models. It is documented in the image processor's docstring, but could be clearer (at the moment it's also non-grammatical 😬). I agree that, considering this affects the raw model outputs, it would be good to flag. What I would suggest is we add the post-processing method you suggest, and then update the docs to explain this is for resizing + removing padding.

Here, I share a post_processing_depth_estimation function that removes the padding, resizes the model output to match the input image, and additionally returns a depth image.

Very nice! Would you like to open a PR to add this? This way you get the github contribution for the work you've done.

In an ideal world, to be coherent with the rest of the models and avoid confusion, I believe that the output of the ZoeDepth model should somehow be already cropped to remove the padding; however this is not very easy considering its inputs (an input for source_size should be added in this case).

This wouldn't be coherent with other models in the library. We don't apply processing within our models, this is the responsibility of our processing classes e.g. the tokenizer or processor. Having this logic in a post-processing method is the right place to contain this logic.

Still, after using the function above, there is a discrepancy between the output I get using HF vs using the original implementation

There could be a multitude of things happening here, as the call to the torch hub model is including any pre- and post- processing, as well as the model's forward pass. cc @NielsRogge regarding differences observed when porting

Hi,

Thanks for the detailed report, I ported the model by performing inference on the original repository and then making sure both the preprocessing + a forward pass matches with the HF implementation. The script I used on the original implementation can be found here: https://github.com/isl-org/ZoeDepth/compare/main...NielsRogge:ZoeDepth:understanding_zoedepth?expand=1 (I ran the inference.py script). The logits and preprocessing is verified in the conversion script: https://github.com/huggingface/transformers/blob/main/src/transformers/models/zoedepth/convert_zoedepth_to_hf.py

The notebook link by @NielsRogge probably has some code to show how the padding is added and removed, however since I do not have access to it, I'll share here the various info I have gathered as well as a postprocessing function roughly following the style of post_process_object_detection.

The notebook is available here: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ZoeDepth/Inference_with_ZoeDepth.ipynb.

Hey @amyeroberts and thanks for the quick response,

Very nice! Would you like to open a PR to add this? This way you get the GitHub contribution for the work you've done.

I'll probably get to it from tomorrow, thanks.

It's not uncommon to pad images before passing them to the model such that they can be batched. I'm surprised this is being applied even for a single image cc @NielsRogge to confirm the intended behaviour here.

Indeed, however, the padding here is not for the inputs to be batched, but rather to fix the boundary artifacts in the output depth map. Additionally, it is a dynamic padding (based on the size of the input image) which makes removing it more complicated (i.e., I had to search the source code of the ImageProcessor or the original implementation to check how much padding was added and at what stage). Still, since all the rest of the depth models do not output a padded image, nor does the original implementation (it has integrated the post-processing step), I believe a small note on the docs will help a lot of future devs trying to use the model.

@NielsRogge I looked over what you sent and indeed it seems to suggest a match between the HF implementation and the original repo. Maybe the discrepancy in my code has something to do with the padding removal (?). I will do a more in-depth check and get back to you.

Hey, so I revisited this today.

It turns out that on the original repo, they infer the image as well as a horizontal flip of the image and then average out the results as can be seen here.

Inconsistency fix between Transformers and original repo

So, to have a 1-1 mapping with the original implementation we need to do the following:

from PIL import Image
import requests

import torch
import numpy as np
from transformers import AutoImageProcessor, ZoeDepthForDepthEstimation

image_processor = AutoImageProcessor.from_pretrained("Intel/zoedepth-nyu")
model = ZoeDepthForDepthEstimation.from_pretrained("Intel/zoedepth-nyu")

# prepare image for the model

url = "https://www.greece-is.com/wp-content/uploads/2016/07/ATH_RIVIERA_naos-poseidona-sounio-01.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=[image, image], return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    outputs_flip = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))

processed_output = post_process_depth_estimation_zoedepth(outputs, [image.size[::-1]]*2, outputs_flip=outputs_flip)[0]

print("Input image size (h, w):", image.size[::-1])
print("Output predicted depth shape (h, w):", processed_output["predicted_depth"].shape)
print("Output depth image size (h, w):", processed_output["depth"].size[::-1])
processed_output["depth"].show()

Where the post_process_depth_estimation_zoedepth function is updated to accept the flipped outputs:

from typing import Union, List, Tuple, Dict
from PIL import Image

import torch
from torch.nn import functional as F
from transformers.models.zoedepth.modeling_zoedepth import ZoeDepthDepthEstimatorOutput

def post_process_depth_estimation_zoedepth(
    outputs: ZoeDepthDepthEstimatorOutput,
    source_sizes: Union[torch.Tensor, List[Tuple[int, int]]],
    target_sizes: Union[torch.Tensor, List[Tuple[int, int]]] = None,
    outputs_flip: ZoeDepthDepthEstimatorOutput | None = None,
    remove_padding: bool = True,
) -> List[Dict] :
    """
    Converts the raw output of [`ZoeDepthDepthEstimatorOutput`] into final depth predictions and depth PIL image.
    Only supports PyTorch.

    Args:
        outputs ([`ZoeDepthDepthEstimatorOutput`]):
            Raw outputs of the model.
        outputs_flip ([`ZoeDepthDepthEstimatorOutput`], *optional*):
            Raw outputs of the model from flipped input (averaged out in the end).
        source_sizes (`torch.Tensor` or `List[Tuple[int, int]]`):
            Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the source size
            (height, width) of each image in the batch before preprocessing.
        target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*):
            Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
            (height, width) of each image in the batch. If left to None, predictions will not be resized.
        remove_padding (`bool`):
            By default ZoeDepth addes padding to fix the boundary artifacts in the output depth map, so we need
            remove this padding during post_processing. The parameter exists here in case the user changed the
            image preprocessing to not include padding.

    Returns:
        `List[Dict]`: A list of dictionaries, each dictionary containing the depth predictions and a depth PIL
        image as predicted by the model.
    """
    predicted_depth = outputs.predicted_depth

    if (outputs_flip is not None) and (predicted_depth.shape != outputs_flip.predicted_depth.shape):
        raise ValueError(
            "Make sure that `outputs` and `outputs_flip` have the same shape"
        )

    if (target_sizes is not None) and (len(predicted_depth) != len(target_sizes)):
        raise ValueError(
            "Make sure that you pass in as many target sizes as the batch dimension of the predicted depth"
        )

    if (source_sizes is None) or (len(predicted_depth) != len(source_sizes)):
        raise ValueError(
            "Make sure that you pass in as many source image sizes as the batch dimension of the logits"
        )

    if outputs_flip is not None:
        predicted_depth = torch.stack([predicted_depth, outputs_flip.predicted_depth], dim=1)
    else:
        predicted_depth = predicted_depth.unsqueeze(1)

    # Zoe Depth model adds padding around the images to fix the boundary artifacts in the output depth map
    # The padding length is `int(np.sqrt(img_h/2) * fh)` for the height and similar for the width
    # fh (and fw respectively) are equal to '3' by default
    # Check [here](https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/models/depth_model.py#L57)
    # for the original implementation.
    # In this section, we remove this padding to get the final depth image and depth prediction
    if isinstance(source_sizes, List):
        img_h = torch.Tensor([i[0] for i in source_sizes])
        img_w = torch.Tensor([i[1] for i in source_sizes])
    else:
        img_h, img_w = source_sizes.unbind(1)

    fh = fw = 3

    results = []
    for i, (d, s) in enumerate(zip(predicted_depth, source_sizes)):
        if remove_padding:
            pad_h = int(np.sqrt(s[0]/2) * fh)
            pad_w = int(np.sqrt(s[1]/2) * fw)
            d = F.interpolate(
                d.unsqueeze(1), size=[s[0] + 2*pad_h, s[1] + 2*pad_w],
                mode="bicubic", align_corners=False
            )

            if pad_h > 0:
                d = d[:, :, pad_h:-pad_h, :]
            if pad_w > 0:
                d = d[:, :, :, pad_w:-pad_w]

        if target_sizes is not None:
            target_size = target_sizes[i]
            d = F.interpolate(d, size=target_size, mode="bicubic", align_corners=False)

        if outputs_flip != None:
            d, d_f = d.chunk(2)
            d = (d + torch.flip(d_f, dims=[-1])) / 2

        d = d.squeeze().cpu().numpy()
        pil = Image.fromarray((d * 255 / np.max(d)).astype("uint8"))
        results.append({"predicted_depth": d, "depth": pil})

    return results

With these changes, when we compare with the original implementation we get:

Max error: 6.7949295e-05
Min error: -4.7922134e-05
MSE: 2.5678112e-11
RMSE: 5.0673575e-06
MAE: 3.975931e-06

So this is fixed.

Incorporation of `do_pad` inside the `image_processor`

However, for the PR, to "better" incorporate this flipped input/output with the model, I was wondering whether it would be better to add an input do_flip to the image processor (similar to the do_pad) or leave it to each user to run the flipped input through the model and then passing it to the post-processor. --> I think I prefer the incorporation of the functionality into the image_processor, but I wanted to get your opinion @NielsRogge .

If the image is originally flipped, IMO it makes sense to use do_flip

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers