DPT normalization causes contouring when there are significant disparities in depth values between adjacent areas

CyrusVorwald commented 10 months ago

System Info

Python 3.10.12 transformers-4.36.2

Who can help?

@stevhliu @NielsRogge

Information

[X] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "https://images.unsplash.com/photo-1605146768851-eda79da39897?q=80&w=2970&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image = Image.open(requests.get(url, stream=True).raw)

processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

# prepare image for the model
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
display(depth)

Some weights of DPTForDepthEstimation were not initialized from the model checkpoint at Intel/dpt-large and are newly initialized: ['neck.fusion_stage.layers.0.residual_layer1.convolution2.bias', 'neck.fusion_stage.layers.0.residual_layer1.convolution1.weight', 'neck.fusion_stage.layers.0.residual_layer1.convolution2.weight', 'neck.fusion_stage.layers.0.residual_layer1.convolution1.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

edge_effects_depth (1)

Expected behavior

Anecdotally, the local scaling methodology used by get_depth_map at https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0 seems to work better for models that perform better at identifying close-range depth. The global scaling methodology seems to work better for models that perform better at identifying far-range depth. I combined them below:

def get_depth_map(image, feature_extractor, depth_estimator, scale_local):
    inputs = feature_extractor(images=image, return_tensors="pt").pixel_values.to("cuda")
    with torch.no_grad(), torch.autocast("cuda"):
        depth_map = depth_estimator(inputs).predicted_depth

    depth_map = torch.nn.functional.interpolate(
        depth_map.unsqueeze(1),
        size=image.size[::-1],
        mode="bicubic",
        align_corners=False,
    )

    if scale_local:
        depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True)
        depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True)
        depth_map = (depth_map - depth_min) / (depth_max - depth_min)
        image = torch.cat([depth_map] * 3, dim=1)

        image = image.permute(0, 2, 3, 1).cpu().numpy()[0]
        image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8))
        return image

    output = depth_map.squeeze().cpu().numpy()
    formatted = (output * 255 / np.max(output)).astype("uint8")
    return Image.fromarray(formatted)

depth_estimator_hybrid = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to("cuda")
depth_estimator_dinov2_nyu = DPTForDepthEstimation.from_pretrained("facebook/dpt-dinov2-giant-nyu").to("cuda")

image_processor_hybrid = AutoImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")
image_processor_dinov2_nyu = AutoImageProcessor.from_pretrained("facebook/dpt-dinov2-giant-nyu")

# Close range depth
bad_close_result = get_depth_map(image, image_processor_hybrid, depth_estimator_hybrid, False)
good_close_result = get_depth_map(image, image_processor_hybrid, depth_estimator_hybrid, True)

# Far range depth
downscaled_image = image.resize((1024, 1024))  # This image is too big for my GPU to processdpt-dinov2-giant-nyu so I downscaled it
good_far_result = get_depth_map(downscaled_image, image_processor_dinov2_nyu, depth_estimator_dinov2_nyu, False)
bad_far_result = get_depth_map(downscaled_image, image_processor_dinov2_nyu, depth_estimator_dinov2_nyu, True)

display(bad_close_result) globally_scaled_depth_close

display(good_close_result) locally_scaled_depth_close

display(good_far_result) globally_scaled_depth_far display(bad_far_result) locally_scaled_depth_far

Sufficiently blurring the image prior to detecting depth also gets rid of this, ie:

blurred_image = image.filter(ImageFilter.GaussianBlur(radius=5))
display(get_depth_map(blurred_image, image_processor_hybrid, depth_estimator_hybrid, False))

blurred_depth

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts commented 8 months ago

Hi @CyrusVorwald, thanks for opening this issue!

get_depth_map isn't defined in the transformers library, and so it's not something we can work on. I'd suggest opening a discussion on the model page and sharing these results.

@NielsRogge Could you look into some weights being randomly initialized when loading from this checkpoint?

NielsRogge commented 8 months ago

I think the warning being shown of some weights not being initialized happened after @younesbelkada added support for DPT-hybrid in the modeling_dpt.py code. This hybrid version of DPT introduced some other parameters, which aren't used by the default DPT model.

amyeroberts commented 8 months ago

@NielsRogge I'm not sure that's correct. The warning is saying that model parameters are being randomly initialized i.e. the model has those values and they're not present in the state dict being used. Moreover, the weights are for neck.fusion_stage.layers.0, which according to git blame were layers added as part of the original model.

amyeroberts commented 7 months ago

Gentle ping @NielsRogge

NielsRogge commented 7 months ago

Thanks for the ping, so this has to do with the following:

the original implementation from where I took the DPT code sets the first layer to None: https://github.com/open-mmlab/mmsegmentation/blob/b040e147adfa027bbc071b624bedf0ae84dfc922/mmseg/models/decode_heads/dpt_head.py#L271
the HF implementation on the other hand defines explicit parameters for the first layer (self.convolution1, etc.), but those aren't actually required, as there's no residual being passed to the first layer, which means that this block doesn't get executed.

TLDR this is fine, all weights are used, however the implementation could be improved to avoid having the warning of no weights being randomly initialized. Marking this as a good second issue for now.

huggingface / transformers