sony / ai-research-code

Apache License 2.0
348 stars 66 forks source link

[Quantized Depth Completion] Questions about implementation details #61

Open james77777778 opened 2 years ago

james77777778 commented 2 years ago

First of all, thanks for the great work, but the source code is still missing. Could you share the training/evaluating code and pretrained weights about this work?

Also, I'm trying to reimplement with PyTorch and I have some questions about the paper:

  1. How to compute the surface normal from ground truth depth in NYU Depth v2? The paper only shows the approximation for training but not the accurate one?!
  2. The dot pattern to produce sparse depth in NYU Depth v2 is unknown. Can you share the example to reproduce?
  3. The kernel_size of MaxPooling2D is missing
  4. The kernel_size, filter_size of Conv2d in Upsampling layer is missing

Thanks! Look forward to your kind reply.

VC86 commented 2 years ago

Dear HongYu,

thank you for your interest!

james77777778 commented 2 years ago

Hi @VC86, Thank you for your fast and informative reply!

May I ask further for the Normals Estimation Block? I try to build it like this:

import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms.functional as TF

arr = np.array(range(25))
tensor = torch.from_numpy(arr).to(torch.float).reshape(1, 1, 5, 5)
grad_x_layer = nn.Conv2d(1, 1, kernel_size=(1, 3), stride=1, padding=(0, 1), bias=False, padding_mode='replicate')
grad_y_layer = nn.Conv2d(1, 1, kernel_size=(3, 1), stride=1, padding=(1, 0), bias=False, padding_mode='replicate')
with torch.no_grad():
    grad_x_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 1, 3)))
    grad_y_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 3, 1)))
    grad_x = grad_x_layer(tensor)
    grad_y = grad_y_layer(tensor)
    minus_1 = -1 * torch.ones_like(tensor)
    normals =, grad_y, minus_1), dim=1)
    normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)
print('input:\n', tensor)
print('grad_x:\n', grad_x)
print('grad_y:\n', grad_y)
print('normals:\n', normals)

and the output:

 tensor([[[[ 0.,  1.,  2.,  3.,  4.],
          [ 5.,  6.,  7.,  8.,  9.],
          [10., 11., 12., 13., 14.],
          [15., 16., 17., 18., 19.],
          [20., 21., 22., 23., 24.]]]])
 tensor([[[[0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
          [0.5000, 1.0000, 1.0000, 1.0000, 0.5000]]]])
 tensor([[[[2.5000, 2.5000, 2.5000, 2.5000, 2.5000],
          [5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
          [5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
          [5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
          [2.5000, 2.5000, 2.5000, 2.5000, 2.5000]]]])
 tensor([[[[ 0.1826,  0.3482,  0.3482,  0.3482,  0.1826],
          [ 0.0976,  0.1925,  0.1925,  0.1925,  0.0976],
          [ 0.0976,  0.1925,  0.1925,  0.1925,  0.0976],
          [ 0.0976,  0.1925,  0.1925,  0.1925,  0.0976],
          [ 0.1826,  0.3482,  0.3482,  0.3482,  0.1826]],

         [[ 0.9129,  0.8704,  0.8704,  0.8704,  0.9129],
          [ 0.9759,  0.9623,  0.9623,  0.9623,  0.9759],
          [ 0.9759,  0.9623,  0.9623,  0.9623,  0.9759],
          [ 0.9759,  0.9623,  0.9623,  0.9623,  0.9759],
          [ 0.9129,  0.8704,  0.8704,  0.8704,  0.9129]],

         [[-0.3651, -0.3482, -0.3482, -0.3482, -0.3651],
          [-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
          [-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
          [-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
          [-0.3651, -0.3482, -0.3482, -0.3482, -0.3651]]]])

Is it correct?

Because after directly applying this implementation on GT depth in NYU Depth v2, the result is strange compared to the visualization in the paper. For example (data/nyudepthv2/val/official/00001.h5):

The minimal and reproducible snippets:

import h5py
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms.functional as TF
from PIL import Image

with h5py.File('data/nyudepthv2/val/official/00001.h5', 'r') as f:
    gt_depth = torch.from_numpy(np.array(f['depth'], dtype=np.float32)).unsqueeze(0).unsqueeze(0)  # (B, 1, H, W)
    rgb_img = Image.fromarray(np.transpose(f['rgb'], (1, 2, 0)))

grad_x_layer = nn.Conv2d(1, 1, kernel_size=(1, 3), stride=1, padding=(0, 1), bias=False, padding_mode='replicate')
grad_y_layer = nn.Conv2d(1, 1, kernel_size=(3, 1), stride=1, padding=(1, 0), bias=False, padding_mode='replicate')

with torch.no_grad():
    grad_x_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 1, 3)))
    grad_y_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 3, 1)))
    grad_x = grad_x_layer(gt_depth)
    grad_y = grad_y_layer(gt_depth)
    minus_1 = -1 * torch.ones_like(gt_depth)
    normals =, grad_y, minus_1), dim=1)
    normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)
    normals = ((normals + 1) / 2 * 255).squeeze().to(torch.uint8)

normals = TF.to_pil_image(normals).save('normals.png')'rgb.png')

Thank you so much!

VC86 commented 2 years ago

Your code looks correct and the test above is also numerically correct, but the normals aren't as I would expect them when visualized, indeed (although the way you convert them to UINT8 also looks correct). Some notes:

james77777778 commented 2 years ago

Thanks again for your reply!

From the information that you said you convert to millimeter before normals computing, I modified the code as following:

import h5py
import numpy as np
import torch
import torch.nn.functional as F
import torchvision.transforms.functional as TF
from PIL import Image

with h5py.File('data/nyudepthv2/val/official/00001.h5', 'r') as f:
    gt_depth = torch.from_numpy(np.array(f['depth'], dtype=np.float32)).unsqueeze(0).unsqueeze(0)  # (B, 1, H, W)
    rgb_img = Image.fromarray(np.transpose(f['rgb'], (1, 2, 0)))

# resize & center crop (480, 640) -> (240, 320) -> (224, 304)
gt_depth, rgb_img = TF.resize(gt_depth, (240, 320)), TF.resize(rgb_img, (240, 320))
gt_depth, rgb_img = TF.center_crop(gt_depth, (224, 304)), TF.center_crop(rgb_img, (224, 304))

# take the scale into account (meter to millimeter)
scaled_gt_depth = gt_depth * 1000.0

# compute normals
grad_x_weights = torch.tensor((-0.5, 0, 0.5), dtype=torch.float, requires_grad=False)
grad_x_weights = grad_x_weights.reshape((1, 1, 1, 3))
grad_y_weights = torch.tensor((-0.5, 0, 0.5), dtype=torch.float, requires_grad=False)
grad_y_weights = grad_y_weights.reshape((1, 1, 3, 1))
with torch.no_grad():
    x_padded_dense_depth = F.pad(scaled_gt_depth, (1, 1, 0, 0), 'replicate')
    y_padded_dense_depth = F.pad(scaled_gt_depth, (0, 0, 1, 1), 'replicate')
    grad_x = F.conv2d(x_padded_dense_depth, grad_x_weights)
    grad_y = F.conv2d(y_padded_dense_depth, grad_y_weights)
    minus_1 = -1 * torch.ones_like(scaled_gt_depth)
    normals =, grad_y, minus_1), dim=1)
    normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)

# visualization
print(f'normals stats: min={torch.min(normals):.2f}, max={torch.max(normals):.2f}')
TF.to_pil_image(((normals + 1) / 2 * 255).squeeze().to(torch.uint8)).save('normals.png')'rgb.png')

and the output:

normals stats: min=-1.00, max=1.00, median=-0.07


I think the visualization is far better than previous one. Is the procedure of scaling correct? (I'm not familiar with surface normals πŸ˜”) scaled_gt_depth = gt_depth * 1000.0


zherlock030 commented 1 year ago

@james77777778 Did u replicate the results of this paper? Is it good after quantization