[Quantized Depth Completion] Questions about implementation details

james77777778 commented 2 years ago

First of all, thanks for the great work, but the source code is still missing. Could you share the training/evaluating code and pretrained weights about this work?

Also, I'm trying to reimplement with PyTorch and I have some questions about the paper:

How to compute the surface normal from ground truth depth in NYU Depth v2? The paper only shows the approximation for training but not the accurate one?!
The dot pattern to produce sparse depth in NYU Depth v2 is unknown. Can you share the example to reproduce?
The kernel_size of MaxPooling2D is missing
The kernel_size, filter_size of Conv2d in Upsampling layer is missing

Thanks! Look forward to your kind reply.

VC86 commented 2 years ago

Dear HongYu,

thank you for your interest!

Q: Could you share the training/evaluating code and pretrained weights about this work?
A: Thanks for your interest again; as this code originates from a commercial research entity, the code contains non-shareable elements that would require a license (it is copyright Sony 2022). The intention of this landing page is to provide links/citation and share the evaluation set promised in the paper.
Q: How to compute the surface normal from ground truth depth in NYU Depth v2? The paper only shows the approximation for training but not the accurate one?!
A: The surface normals can be estimated by optionally smoothing the NYU GT, and by applying the same normals estimation operator to both GT and data source, while accounting properly for invalid pixels (present in parts of NYU, notably in the extended dataset) and removing invalid normals from the normals loss computation. This is good enough for supervising the training process.
Q: The dot pattern to produce sparse depth in NYU Depth v2 is unknown. Can you share the example to reproduce?
A: To share the code generating the dot pattern we would have to sign a license agreement with you for strict non-commercial use of the code (e.g., if you are in a university or non-profit organization) as that code, too, is copyrighted. Feel free to reach out (to me) if your use would comply with this.
Q: The kernel_size of MaxPooling2D is missing
A: The kernel size for the MaxPools is always 2, to reduce the resolution by a factor 2 in each dimension. This is indicated in the paper.
Q: The kernel_size, filter_size of Conv2d in Upsampling layer is missing
A: As specified in the paper, the kernel size of Conv2D is always 3 (except the last MLP which uses kernel size 1) and stride is always 1 with “same” padding. The upsampling factor is also always 2. The number of filters in the upsample – convolve stage (decoder) mirrors the number of filters in the encoder at the same scale, as per the graph in Figure 2b.

james77777778 commented 2 years ago

Hi @VC86, Thank you for your fast and informative reply!

May I ask further for the Normals Estimation Block? I try to build it like this:

import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms.functional as TF

arr = np.array(range(25))
tensor = torch.from_numpy(arr).to(torch.float).reshape(1, 1, 5, 5)
grad_x_layer = nn.Conv2d(1, 1, kernel_size=(1, 3), stride=1, padding=(0, 1), bias=False, padding_mode='replicate')
grad_y_layer = nn.Conv2d(1, 1, kernel_size=(3, 1), stride=1, padding=(1, 0), bias=False, padding_mode='replicate')
with torch.no_grad():
    grad_x_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 1, 3)))
    grad_y_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 3, 1)))
    grad_x = grad_x_layer(tensor)
    grad_y = grad_y_layer(tensor)
    minus_1 = -1 * torch.ones_like(tensor)
    normals = torch.cat((grad_x, grad_y, minus_1), dim=1)
    normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)
print('input:\n', tensor)
print('grad_x:\n', grad_x)
print('grad_y:\n', grad_y)
print('normals:\n', normals)

and the output:

input:
 tensor([[[[ 0.,  1.,  2.,  3.,  4.],
          [ 5.,  6.,  7.,  8.,  9.],
          [10., 11., 12., 13., 14.],
          [15., 16., 17., 18., 19.],
          [20., 21., 22., 23., 24.]]]])
grad_x:
 tensor([[[[0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000]]]])
grad_y:
 tensor([[[[2.5000, 2.5000, 2.5000, 2.5000, 2.5000],
          [5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
          [5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
          [5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
          [2.5000, 2.5000, 2.5000, 2.5000, 2.5000]]]])
normals:
 tensor([[[[ 0.1826,  0.3482,  0.3482,  0.3482,  0.1826],
          [ 0.0976,  0.1925,  0.1925,  0.1925,  0.0976],
          [ 0.0976,  0.1925,  0.1925,  0.1925,  0.0976],
          [ 0.0976,  0.1925,  0.1925,  0.1925,  0.0976],
          [ 0.1826,  0.3482,  0.3482,  0.3482,  0.1826]],

         [[ 0.9129,  0.8704,  0.8704,  0.8704,  0.9129],
          [ 0.9759,  0.9623,  0.9623,  0.9623,  0.9759],
          [ 0.9759,  0.9623,  0.9623,  0.9623,  0.9759],
          [ 0.9759,  0.9623,  0.9623,  0.9623,  0.9759],
          [ 0.9129,  0.8704,  0.8704,  0.8704,  0.9129]],

         [[-0.3651, -0.3482, -0.3482, -0.3482, -0.3651],
          [-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
          [-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
          [-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
          [-0.3651, -0.3482, -0.3482, -0.3482, -0.3651]]]])

Is it correct?

Because after directly applying this implementation on GT depth in NYU Depth v2, the result is strange compared to the visualization in the paper. For example (data/nyudepthv2/val/official/00001.h5):

RGB
Normals from estimation (with GT depth)

The minimal and reproducible snippets:

import h5py
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms.functional as TF
from PIL import Image

with h5py.File('data/nyudepthv2/val/official/00001.h5', 'r') as f:
    gt_depth = torch.from_numpy(np.array(f['depth'], dtype=np.float32)).unsqueeze(0).unsqueeze(0)  # (B, 1, H, W)
    rgb_img = Image.fromarray(np.transpose(f['rgb'], (1, 2, 0)))

grad_x_layer = nn.Conv2d(1, 1, kernel_size=(1, 3), stride=1, padding=(0, 1), bias=False, padding_mode='replicate')
grad_y_layer = nn.Conv2d(1, 1, kernel_size=(3, 1), stride=1, padding=(1, 0), bias=False, padding_mode='replicate')

with torch.no_grad():
    grad_x_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 1, 3)))
    grad_y_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 3, 1)))
    grad_x = grad_x_layer(gt_depth)
    grad_y = grad_y_layer(gt_depth)
    minus_1 = -1 * torch.ones_like(gt_depth)
    normals = torch.cat((grad_x, grad_y, minus_1), dim=1)
    normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)
    normals = ((normals + 1) / 2 * 255).squeeze().to(torch.uint8)

normals = TF.to_pil_image(normals).save('normals.png')
rgb_img.save('rgb.png')

Thank you so much!

VC86 commented 2 years ago

Your code looks correct and the test above is also numerically correct, but the normals aren't as I would expect them when visualized, indeed (although the way you convert them to UINT8 also looks correct). Some notes:

In some flavors you may want to concatenate (-grad_x, -grad_y, torch.ones_like(gt_depth)) depending on what convention you follow for the normals (as noted in the paper -- the conventions must match between GT and estimate).
Remember that you are taking the centered differences (dz/dy, dz/dx) but you also need to take the scale into account in this normals approximation. The normals are scale-invariant quantities by definition, but the way you compute them is not 😉

james77777778 commented 2 years ago

Thanks again for your reply!

From the information that you said you convert to millimeter before normals computing, I modified the code as following:

import h5py
import numpy as np
import torch
import torch.nn.functional as F
import torchvision.transforms.functional as TF
from PIL import Image

with h5py.File('data/nyudepthv2/val/official/00001.h5', 'r') as f:
    gt_depth = torch.from_numpy(np.array(f['depth'], dtype=np.float32)).unsqueeze(0).unsqueeze(0)  # (B, 1, H, W)
    rgb_img = Image.fromarray(np.transpose(f['rgb'], (1, 2, 0)))

# resize & center crop (480, 640) -> (240, 320) -> (224, 304)
gt_depth, rgb_img = TF.resize(gt_depth, (240, 320)), TF.resize(rgb_img, (240, 320))
gt_depth, rgb_img = TF.center_crop(gt_depth, (224, 304)), TF.center_crop(rgb_img, (224, 304))

# take the scale into account (meter to millimeter)
scaled_gt_depth = gt_depth * 1000.0

# compute normals
grad_x_weights = torch.tensor((-0.5, 0, 0.5), dtype=torch.float, requires_grad=False)
grad_x_weights = grad_x_weights.reshape((1, 1, 1, 3))
grad_y_weights = torch.tensor((-0.5, 0, 0.5), dtype=torch.float, requires_grad=False)
grad_y_weights = grad_y_weights.reshape((1, 1, 3, 1))
with torch.no_grad():
    x_padded_dense_depth = F.pad(scaled_gt_depth, (1, 1, 0, 0), 'replicate')
    y_padded_dense_depth = F.pad(scaled_gt_depth, (0, 0, 1, 1), 'replicate')
    grad_x = F.conv2d(x_padded_dense_depth, grad_x_weights)
    grad_y = F.conv2d(y_padded_dense_depth, grad_y_weights)
    minus_1 = -1 * torch.ones_like(scaled_gt_depth)
    normals = torch.cat((grad_x, grad_y, minus_1), dim=1)
    normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)

# visualization
print(f'normals stats: min={torch.min(normals):.2f}, max={torch.max(normals):.2f}')
TF.to_pil_image(((normals + 1) / 2 * 255).squeeze().to(torch.uint8)).save('normals.png')
rgb_img.save('rgb.png')

and the output:

normals stats: min=-1.00, max=1.00, median=-0.07

normals2

I think the visualization is far better than previous one. Is the procedure of scaling correct? (I'm not familiar with surface normals 😔) scaled_gt_depth = gt_depth * 1000.0

Ref: https://github.com/Ruthrash/surface_normal_filter

zherlock030 commented 1 year ago

@james77777778 Did u replicate the results of this paper? Is it good after quantization

sony / ai-research-code

[Quantized Depth Completion] Questions about implementation details #61