Open james77777778 opened 2 years ago
Dear HongYu,
thank you for your interest!
Q: Could you share the training/evaluating code and pretrained weights about this work?
A: Thanks for your interest again; as this code originates from a commercial research entity, the code contains non-shareable elements that would require a license (it is copyright Sony 2022). The intention of this landing page is to provide links/citation and share the evaluation set promised in the paper.
Q: How to compute the surface normal from ground truth depth in NYU Depth v2? The paper only shows the approximation for training but not the accurate one?!
A: The surface normals can be estimated by optionally smoothing the NYU GT, and by applying the same normals estimation operator to both GT and data source, while accounting properly for invalid pixels (present in parts of NYU, notably in the extended dataset) and removing invalid normals from the normals loss computation. This is good enough for supervising the training process.
Q: The dot pattern to produce sparse depth in NYU Depth v2 is unknown. Can you share the example to reproduce?
A: To share the code generating the dot pattern we would have to sign a license agreement with you for strict non-commercial use of the code (e.g., if you are in a university or non-profit organization) as that code, too, is copyrighted. Feel free to reach out (to me) if your use would comply with this.
Q: The kernel_size of MaxPooling2D is missing
A: The kernel size for the MaxPools is always 2, to reduce the resolution by a factor 2 in each dimension. This is indicated in the paper.
Q: The kernel_size, filter_size of Conv2d in Upsampling layer is missing
A: As specified in the paper, the kernel size of Conv2D is always 3 (except the last MLP which uses kernel size 1) and stride is always 1 with βsameβ padding. The upsampling factor is also always 2. The number of filters in the upsample β convolve stage (decoder) mirrors the number of filters in the encoder at the same scale, as per the graph in Figure 2b.
Hi @VC86, Thank you for your fast and informative reply!
May I ask further for the Normals Estimation Block
?
I try to build it like this:
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms.functional as TF
arr = np.array(range(25))
tensor = torch.from_numpy(arr).to(torch.float).reshape(1, 1, 5, 5)
grad_x_layer = nn.Conv2d(1, 1, kernel_size=(1, 3), stride=1, padding=(0, 1), bias=False, padding_mode='replicate')
grad_y_layer = nn.Conv2d(1, 1, kernel_size=(3, 1), stride=1, padding=(1, 0), bias=False, padding_mode='replicate')
with torch.no_grad():
grad_x_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 1, 3)))
grad_y_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 3, 1)))
grad_x = grad_x_layer(tensor)
grad_y = grad_y_layer(tensor)
minus_1 = -1 * torch.ones_like(tensor)
normals = torch.cat((grad_x, grad_y, minus_1), dim=1)
normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)
print('input:\n', tensor)
print('grad_x:\n', grad_x)
print('grad_y:\n', grad_y)
print('normals:\n', normals)
and the output:
input:
tensor([[[[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[10., 11., 12., 13., 14.],
[15., 16., 17., 18., 19.],
[20., 21., 22., 23., 24.]]]])
grad_x:
tensor([[[[0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
[0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
[0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
[0.5000, 1.0000, 1.0000, 1.0000, 0.5000],
[0.5000, 1.0000, 1.0000, 1.0000, 0.5000]]]])
grad_y:
tensor([[[[2.5000, 2.5000, 2.5000, 2.5000, 2.5000],
[5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
[5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
[5.0000, 5.0000, 5.0000, 5.0000, 5.0000],
[2.5000, 2.5000, 2.5000, 2.5000, 2.5000]]]])
normals:
tensor([[[[ 0.1826, 0.3482, 0.3482, 0.3482, 0.1826],
[ 0.0976, 0.1925, 0.1925, 0.1925, 0.0976],
[ 0.0976, 0.1925, 0.1925, 0.1925, 0.0976],
[ 0.0976, 0.1925, 0.1925, 0.1925, 0.0976],
[ 0.1826, 0.3482, 0.3482, 0.3482, 0.1826]],
[[ 0.9129, 0.8704, 0.8704, 0.8704, 0.9129],
[ 0.9759, 0.9623, 0.9623, 0.9623, 0.9759],
[ 0.9759, 0.9623, 0.9623, 0.9623, 0.9759],
[ 0.9759, 0.9623, 0.9623, 0.9623, 0.9759],
[ 0.9129, 0.8704, 0.8704, 0.8704, 0.9129]],
[[-0.3651, -0.3482, -0.3482, -0.3482, -0.3651],
[-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
[-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
[-0.1952, -0.1925, -0.1925, -0.1925, -0.1952],
[-0.3651, -0.3482, -0.3482, -0.3482, -0.3651]]]])
Is it correct?
Because after directly applying this implementation on GT depth in NYU Depth v2, the result is strange compared to the visualization in the paper.
For example (data/nyudepthv2/val/official/00001.h5
):
The minimal and reproducible snippets:
import h5py
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms.functional as TF
from PIL import Image
with h5py.File('data/nyudepthv2/val/official/00001.h5', 'r') as f:
gt_depth = torch.from_numpy(np.array(f['depth'], dtype=np.float32)).unsqueeze(0).unsqueeze(0) # (B, 1, H, W)
rgb_img = Image.fromarray(np.transpose(f['rgb'], (1, 2, 0)))
grad_x_layer = nn.Conv2d(1, 1, kernel_size=(1, 3), stride=1, padding=(0, 1), bias=False, padding_mode='replicate')
grad_y_layer = nn.Conv2d(1, 1, kernel_size=(3, 1), stride=1, padding=(1, 0), bias=False, padding_mode='replicate')
with torch.no_grad():
grad_x_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 1, 3)))
grad_y_layer.weight = nn.Parameter(torch.tensor((-0.5, 0, 0.5)).reshape((1, 1, 3, 1)))
grad_x = grad_x_layer(gt_depth)
grad_y = grad_y_layer(gt_depth)
minus_1 = -1 * torch.ones_like(gt_depth)
normals = torch.cat((grad_x, grad_y, minus_1), dim=1)
normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)
normals = ((normals + 1) / 2 * 255).squeeze().to(torch.uint8)
normals = TF.to_pil_image(normals).save('normals.png')
rgb_img.save('rgb.png')
Thank you so much!
Your code looks correct and the test above is also numerically correct, but the normals aren't as I would expect them when visualized, indeed (although the way you convert them to UINT8 also looks correct). Some notes:
(-grad_x, -grad_y, torch.ones_like(gt_depth))
depending on what convention you follow for the normals (as noted in the paper -- the conventions must match between GT and estimate).Thanks again for your reply!
From the information that you said you convert to millimeter before normals computing, I modified the code as following:
import h5py
import numpy as np
import torch
import torch.nn.functional as F
import torchvision.transforms.functional as TF
from PIL import Image
with h5py.File('data/nyudepthv2/val/official/00001.h5', 'r') as f:
gt_depth = torch.from_numpy(np.array(f['depth'], dtype=np.float32)).unsqueeze(0).unsqueeze(0) # (B, 1, H, W)
rgb_img = Image.fromarray(np.transpose(f['rgb'], (1, 2, 0)))
# resize & center crop (480, 640) -> (240, 320) -> (224, 304)
gt_depth, rgb_img = TF.resize(gt_depth, (240, 320)), TF.resize(rgb_img, (240, 320))
gt_depth, rgb_img = TF.center_crop(gt_depth, (224, 304)), TF.center_crop(rgb_img, (224, 304))
# take the scale into account (meter to millimeter)
scaled_gt_depth = gt_depth * 1000.0
# compute normals
grad_x_weights = torch.tensor((-0.5, 0, 0.5), dtype=torch.float, requires_grad=False)
grad_x_weights = grad_x_weights.reshape((1, 1, 1, 3))
grad_y_weights = torch.tensor((-0.5, 0, 0.5), dtype=torch.float, requires_grad=False)
grad_y_weights = grad_y_weights.reshape((1, 1, 3, 1))
with torch.no_grad():
x_padded_dense_depth = F.pad(scaled_gt_depth, (1, 1, 0, 0), 'replicate')
y_padded_dense_depth = F.pad(scaled_gt_depth, (0, 0, 1, 1), 'replicate')
grad_x = F.conv2d(x_padded_dense_depth, grad_x_weights)
grad_y = F.conv2d(y_padded_dense_depth, grad_y_weights)
minus_1 = -1 * torch.ones_like(scaled_gt_depth)
normals = torch.cat((grad_x, grad_y, minus_1), dim=1)
normals = normals / torch.linalg.norm(normals, dim=1, ord=2).unsqueeze(1)
# visualization
print(f'normals stats: min={torch.min(normals):.2f}, max={torch.max(normals):.2f}')
TF.to_pil_image(((normals + 1) / 2 * 255).squeeze().to(torch.uint8)).save('normals.png')
rgb_img.save('rgb.png')
and the output:
normals stats: min=-1.00, max=1.00, median=-0.07
I think the visualization is far better than previous one.
Is the procedure of scaling correct? (I'm not familiar with surface normals π)
scaled_gt_depth = gt_depth * 1000.0
@james77777778 Did u replicate the results of this paper? Is it good after quantization
First of all, thanks for the great work, but the source code is still missing. Could you share the training/evaluating code and pretrained weights about this work?
Also, I'm trying to reimplement with PyTorch and I have some questions about the paper:
Thanks! Look forward to your kind reply.