huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.7k stars 27.17k forks source link

DPT normalization causes contouring when there are significant disparities in depth values between adjacent areas #28292

Open CyrusVorwald opened 11 months ago

CyrusVorwald commented 11 months ago

System Info

Python 3.10.12 transformers-4.36.2

Who can help?

@stevhliu @NielsRogge

Information

Tasks

Reproduction

from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "https://images.unsplash.com/photo-1605146768851-eda79da39897?q=80&w=2970&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image = Image.open(requests.get(url, stream=True).raw)

processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

# prepare image for the model
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
display(depth)

Some weights of DPTForDepthEstimation were not initialized from the model checkpoint at Intel/dpt-large and are newly initialized: ['neck.fusion_stage.layers.0.residual_layer1.convolution2.bias', 'neck.fusion_stage.layers.0.residual_layer1.convolution1.weight', 'neck.fusion_stage.layers.0.residual_layer1.convolution2.weight', 'neck.fusion_stage.layers.0.residual_layer1.convolution1.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

edge_effects_depth (1)

Expected behavior

Anecdotally, the local scaling methodology used by get_depth_map at https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0 seems to work better for models that perform better at identifying close-range depth. The global scaling methodology seems to work better for models that perform better at identifying far-range depth. I combined them below:

def get_depth_map(image, feature_extractor, depth_estimator, scale_local):
    inputs = feature_extractor(images=image, return_tensors="pt").pixel_values.to("cuda")
    with torch.no_grad(), torch.autocast("cuda"):
        depth_map = depth_estimator(inputs).predicted_depth

    depth_map = torch.nn.functional.interpolate(
        depth_map.unsqueeze(1),
        size=image.size[::-1],
        mode="bicubic",
        align_corners=False,
    )

    if scale_local:
        depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True)
        depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True)
        depth_map = (depth_map - depth_min) / (depth_max - depth_min)
        image = torch.cat([depth_map] * 3, dim=1)

        image = image.permute(0, 2, 3, 1).cpu().numpy()[0]
        image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8))
        return image

    output = depth_map.squeeze().cpu().numpy()
    formatted = (output * 255 / np.max(output)).astype("uint8")
    return Image.fromarray(formatted)

depth_estimator_hybrid = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to("cuda")
depth_estimator_dinov2_nyu = DPTForDepthEstimation.from_pretrained("facebook/dpt-dinov2-giant-nyu").to("cuda")

image_processor_hybrid = AutoImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")
image_processor_dinov2_nyu = AutoImageProcessor.from_pretrained("facebook/dpt-dinov2-giant-nyu")

# Close range depth
bad_close_result = get_depth_map(image, image_processor_hybrid, depth_estimator_hybrid, False)
good_close_result = get_depth_map(image, image_processor_hybrid, depth_estimator_hybrid, True)

# Far range depth
downscaled_image = image.resize((1024, 1024))  # This image is too big for my GPU to processdpt-dinov2-giant-nyu so I downscaled it
good_far_result = get_depth_map(downscaled_image, image_processor_dinov2_nyu, depth_estimator_dinov2_nyu, False)
bad_far_result = get_depth_map(downscaled_image, image_processor_dinov2_nyu, depth_estimator_dinov2_nyu, True)

display(bad_close_result) globally_scaled_depth_close

display(good_close_result) locally_scaled_depth_close

display(good_far_result) globally_scaled_depth_far display(bad_far_result) locally_scaled_depth_far

Sufficiently blurring the image prior to detecting depth also gets rid of this, ie:

blurred_image = image.filter(ImageFilter.GaussianBlur(radius=5))
display(get_depth_map(blurred_image, image_processor_hybrid, depth_estimator_hybrid, False))

blurred_depth

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts commented 9 months ago

Hi @CyrusVorwald, thanks for opening this issue!

get_depth_map isn't defined in the transformers library, and so it's not something we can work on. I'd suggest opening a discussion on the model page and sharing these results.

@NielsRogge Could you look into some weights being randomly initialized when loading from this checkpoint?

NielsRogge commented 9 months ago

I think the warning being shown of some weights not being initialized happened after @younesbelkada added support for DPT-hybrid in the modeling_dpt.py code. This hybrid version of DPT introduced some other parameters, which aren't used by the default DPT model.

amyeroberts commented 8 months ago

@NielsRogge I'm not sure that's correct. The warning is saying that model parameters are being randomly initialized i.e. the model has those values and they're not present in the state dict being used. Moreover, the weights are for neck.fusion_stage.layers.0, which according to git blame were layers added as part of the original model.

amyeroberts commented 8 months ago

Gentle ping @NielsRogge

NielsRogge commented 8 months ago

Thanks for the ping, so this has to do with the following:

TLDR this is fine, all weights are used, however the implementation could be improved to avoid having the warning of no weights being randomly initialized. Marking this as a good second issue for now.

psychedelicious commented 1 week ago

We use transformers' DepthEstimationPipeline and have run into what I believe is the same issue described here.

Is it possible to add a workaround without digging into the internals of transfomers?

NielsRogge commented 1 week ago

Pinging @qubvel here who might have some insights

simonfuhrmann commented 5 days ago

@qubvel Can you please provide some insights here?

qubvel commented 5 days ago

Hey @simonfuhrmann @psychedelicious!

I suppose there was similar discussion here.