Size of depth .npy file

plusgrey commented 1 month ago

Hi,

Thanks for your amazing work.

I have one question. I am trying the fusion scaling demo on a 640x480 image. However, the saved depth .npy file tells that its size is 504x364.

Is this because the image_preprocessor of the DPT to fit a 14px patch?

If so, how do we get the depth matching the raw size.

Thanks for your help in advance.

plusgrey commented 1 month ago

Moreover, when I load the weights of metric_depth model from depth anything v2. The original depth color (with all fusion factors set to 1) is all black (all 0.0 in numpy format). Do you have any idea?

plusgrey commented 1 month ago

my test code is:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# ---------------------------------------------------------------------------------------------------------------------
# %% Imports

import os
import os.path as osp
import sys
import argparse
from time import perf_counter

import cv2
import numpy as np
import torch
import torch.nn.functional as F
# This is a hack to make this script work from inside the experiments folder!
# try:
#     import lib # NOQA
# except ModuleNotFoundError:
#     import sys
#     parent_folder = osp.dirname(osp.dirname(__file__))
#     print(parent_folder)
#     if "lib" in os.listdir(parent_folder): sys.path.insert(0, parent_folder)
#     else: raise ImportError("Can't find path to lib folder!")
sys.path.append('/home/user/Desktop/muggled_dpt')
from lib.make_dpt import make_dpt_from_state_dict

from lib.demo_helpers.history_keeper import HistoryKeeper
from lib.demo_helpers.loading import ask_for_path_if_missing, ask_for_model_path_if_missing
from lib.demo_helpers.ui import SliderCB, ColormapButtonsCB, ButtonBar, ScaleByKeypress
from lib.demo_helpers.visualization import DisplayWindow, histogram_equalization
from lib.demo_helpers.saving import save_image, save_numpy_array, save_uint16
from lib.demo_helpers.misc import (
    get_default_device_string, make_device_config, print_config_feedback, reduce_overthreading
)

# ---------------------------------------------------------------------------------------------------------------------
# %% Set up script args

# Set argparse defaults
default_device = get_default_device_string()
default_image_path = '/home/user/Desktop/Test/0000_color.png'
default_model_path = '/home/user/Desktop/Depth-Anything-V2/checkpoints/depth_anything_v2_metric_hypersim_vitl.pth'
#default_model_path = '/home/user/Desktop/Depth-Anything-V2/checkpoints/depth_anything_v2_vitb.pth'
default_display_size = 640
default_base_size = None

# Define script arguments
parser = argparse.ArgumentParser(description="Script used to run MiDaS DPT depth-estimation on a single image")
parser.add_argument("-i", "--image_path", default=default_image_path,
                    help="Path to image to run depth estimation on")
parser.add_argument("-m", "--model_path", default=default_model_path, type=str,
                    help="Path to DPT model weights")
parser.add_argument("-s", "--display_size", default=default_display_size, type=int,
                    help="Controls size of displayed results (default: {})".format(default_display_size))
parser.add_argument("-d", "--device", default=default_device, type=str,
                    help="Device to use when running model (ex: 'cpu', 'cuda', 'mps')")
parser.add_argument("-f32", "--use_float32", default=False, action="store_true",
                    help="Use 32-bit floating point model weights. Note: this doubles VRAM usage")
parser.add_argument("-ar", "--use_aspect_ratio", default=True, action="store_true",
                    help="Process the image at it's original aspect ratio, if the model supports it")
parser.add_argument("-b", "--base_size_px", default=default_base_size, type=int,
                    help="Override base (e.g. 384, 512) model size")

# For convenience
args = parser.parse_args()
arg_image_path = args.image_path
arg_model_path = args.model_path
display_size_px = args.display_size
device_str = args.device
use_float32 = args.use_float32
force_square_resolution = not args.use_aspect_ratio
model_base_size = args.base_size_px

# Hard-code no-cache usage, since there is no benefit if the model only runs once
use_cache = False

# Set up device config
device_config_dict = make_device_config(device_str, use_float32)

# Build pathing to repo-root, so we can search model weights properly
root_path = osp.dirname(osp.dirname(__file__))
save_folder = osp.join(root_path, "saved_images", "fusion_scaling")

# Create history to re-use selected inputs
history = HistoryKeeper(root_path)
_, history_imgpath = history.read("image_path")
_, history_modelpath = history.read("model_path")

# Get pathing to resources, if not provided already
image_path = ask_for_path_if_missing(arg_image_path, "image", history_imgpath)
model_path = ask_for_model_path_if_missing(root_path, arg_model_path, history_modelpath)

# Improve cpu utilization
reduce_overthreading(device_str)

# ---------------------------------------------------------------------------------------------------------------------
# %% Load resources

# Load model & image pre-processor
print("", "Loading model weights...", "  @ {}".format(model_path), sep="\n", flush=True)
model_config_dict, dpt_model, dpt_imgproc = make_dpt_from_state_dict(model_path, use_cache)
model_config_dict['max_depth'] = 20.0
if (model_base_size is not None):
    dpt_imgproc.set_base_size(model_base_size)

# Move model to selected device
dpt_model.to(**device_config_dict)
dpt_model.eval()

# Load image and apply preprocessing
orig_image_bgr = cv2.imread(image_path)
assert orig_image_bgr is not None, f"Error loading image: {image_path}"
img_tensor = dpt_imgproc.prepare_image_bgr(orig_image_bgr, force_square_resolution)
print_config_feedback(model_path, device_config_dict, use_cache, img_tensor)

# Prepare original image for display (and get sizing info)
scaled_input_img = dpt_imgproc.scale_to_max_side_length(orig_image_bgr, display_size_px)
disp_h, disp_w = scaled_input_img.shape[0:2]
disp_wh = (int(disp_w), int(disp_h))

# ---------------------------------------------------------------------------------------------------------------------
# %% Run model

t1 = perf_counter()

# Run model partially to get intermediate tokens for scaling
print("", "Computing reassembly results...", sep="\n", flush=True)
img_tensor = img_tensor.to(**device_config_dict)
with torch.inference_mode():
    patch_tokens, patch_grid_hw = dpt_model.patch_embed(img_tensor)
    imgenc_tokens = dpt_model.imgencoder(patch_tokens, patch_grid_hw)
    reasm_tokens = dpt_model.reassemble(*imgenc_tokens, patch_grid_hw)

t2 = perf_counter()
print("  -> Took", round(1000 * (t2 - t1), 1), "ms")

# ---------------------------------------------------------------------------------------------------------------------
# %% Display results

# Set up button controls

# Read controls
scale_factors = [1,1,1,1]

# Run remaining layers with scaling factors
with torch.inference_mode():
    # Run fusion steps manually, so we can apply scaling factors
    fuse_3 = dpt_model.fusion.blocks[3](reasm_tokens[3] * scale_factors[3])
    fuse_2 = dpt_model.fusion.blocks[2](reasm_tokens[2], fuse_3 * scale_factors[2])
    fuse_1 = dpt_model.fusion.blocks[1](reasm_tokens[1], fuse_2 * scale_factors[1])
    fuse_0 = dpt_model.fusion.blocks[0](reasm_tokens[0], fuse_1 * scale_factors[0])
    depth_prediction = dpt_model.head(fuse_0).squeeze(dim=1)
breakpoint()
# Post-processing for display
scaled_prediction = dpt_imgproc.scale_prediction(depth_prediction, disp_wh)
depth_norm = dpt_imgproc.normalize_01(scaled_prediction).float().cpu().numpy().squeeze()

# Produce colored depth image for display
depth_uint8 = np.uint8(np.round(255.0 * depth_norm))
# depth_color = cmap_btns.apply_colormap(depth_uint8)

# Apply modifications to raw prediction for saving
npy_prediction = (F.interpolate(depth_prediction[:, None], (480, 640), mode="bilinear", align_corners=True)).float().cpu().numpy().squeeze()
print(npy_prediction.shape)
print(np.unique(npy_prediction))
#npy_prediction = F.interpolate(npy_prediction[:, None], (480, 640), mode="bilinear", align_corners=True)[0, 0].float().cpu().numpy()

plusgrey commented 1 month ago

I found this is because the weight of the second conv layer in dpt_head.proj_1ch, it map all value into negative number, and the final ReLU layer clamps them into 0. How can I fixed this for metric depth estiamtion, this is not encountered for relative depth estimation. @heyoeyo

heyoeyo commented 1 month ago

Hi @plusgrey thanks for checking out the repo!

If so, how do we get the depth matching the raw size.

It looks like you've already found a fix for this, using the F.interpolate to scale up the depth prediction. The other thing you can also try is running the model at a different size. It defaults to the 504 sizing, but can be changed using the -b flag, or in your modified version, you can set default_base_size = 640 near the top of the file. This will actually run the model on a higher resolution copy of the image, and usually gives nicer results. Though the sizing is still limited by the 14px patch sizing, so you'd still need to use F.interpolate to get the exact 640x480 size at the end.

when I load the weights of metric_depth model from depth anything v2

The metric-depth models aren't officially supported inside of muggled DPT, I'm actually a bit surprised it even loads!

final ReLU layer clamps them into 0

Looking more closely at the Depth-Anything-v2 metric-depth model, it doesn't seem to be using the normal 'zoedepth' structure that other DPT models use. Instead it nearly matches the relative depth model, except for that last ReLU layer which appears in the relative-depth models but is replaced by a sigmoid in the metric-depth models.

At some point I may have a closer look at the model weights to see if there is some way to auto-detect the metric model and switch the DPTHead implementation automatically... For now though, if you want to load/run the metric depth-anything 2 model, you could manually modify the muggledDPT head and replace that last ReLU layer with a sigmoid to get it working (this generates reasonable looking outputs inside the fusion script). The other thing that's different with the metric-depth model is that it scales that sigmoid output by a max_depth value (20.0 by default), which you'd need to replace (though you can do this to the depth_prediction result in your script, rather than building it into the model). Also, if you're using the metric depth output, you'll want to remove the normalize_01 function call, since that will throw away most of the matric depth information.

plusgrey commented 1 month ago

Hi @plusgrey thanks for checking out the repo!

If so, how do we get the depth matching the raw size.

It looks like you've already found a fix for this, using the F.interpolate to scale up the depth prediction. The other thing you can also try is running the model at a different size. It defaults to the 504 sizing, but can be changed using the -b flag, or in your modified version, you can set default_base_size = 640 near the top of the file. This will actually run the model on a higher resolution copy of the image, and usually gives nicer results. Though the sizing is still limited by the 14px patch sizing, so you'd still need to use F.interpolate to get the exact 640x480 size at the end.

when I load the weights of metric_depth model from depth anything v2

The metric-depth models aren't officially supported inside of muggled DPT, I'm actually a bit surprised it even loads!

final ReLU layer clamps them into 0

Looking more closely at the Depth-Anything-v2 metric-depth model, it doesn't seem to be using the normal 'zoedepth' structure that other DPT models use. Instead it nearly matches the relative depth model, except for that last ReLU layer which appears in the relative-depth models but is replaced by a sigmoid in the metric-depth models.

At some point I may have a closer look at the model weights to see if there is some way to auto-detect the metric model and switch the DPTHead implementation automatically... For now though, if you want to load/run the metric depth-anything 2 model, you could manually modify the muggledDPT head and replace that last ReLU layer with a sigmoid to get it working (this generates reasonable looking outputs inside the fusion script). The other thing that's different with the metric-depth model is that it scales that sigmoid output by a max_depth value (20.0 by default), which you'd need to replace (though you can do this to the depth_prediction result in your script, rather than building it into the model). Also, if you're using the metric depth output, you'll want to remove the normalize_01 function call, since that will throw away most of the matric depth information.

Hi, @heyoeyo. Thanks so much for your reply. I have successfully loaded and used the metric_depth model with your muggled_dpt. Although there are some slight deformations, the global metric depth is acceptable.

heyoeyo / muggled_dpt

Size of depth .npy file #5