The document of torchvision.ops.deform_conv2d is not clear

Zhaoyi-Yan commented 3 years ago

📚 Documentation

From the documentation, I cannot get the exact meaning of 18(ie, 233) channels of the offset in a deformable convolution?

I want to visualize the offset of the deformable convolution with kernel size 3*3. So It’s essential for me to know what’s the exact meaning of these channels.

I write down something possible here:

upper-left: ul
upper-right: ur
bottom-left: bl
bottom-right: br
up: u
bottom: b
right: r
left: l
center: c

possible offset layout (maybe not correct):
delta_ul_x, delta_ul_y,   delta_u_x, delta_u_y,     delta_ur_x, delta_ur_y;
delta_l_x, delta_l_y,       delta_c_x, delta_c_y,      delta_r_x, delta_r_y;
delta_bl_x, delta_bl_y,   delta_b_x, delta_b_y,     delta_br_x, delta_br_y;

voldemortX commented 3 years ago

@Zhaoyi-Yan Perhaps the CPU source code can lend you some insights. I'd say your guess seems reasonable, but I don't remember much detail of DCN so I wouldn't be sure.

Zhaoyi-Yan commented 3 years ago

After reading the source code you refer to, it seems reasonable. However, it would be better to get a detailed note on the offset for the users.

voldemortX commented 3 years ago

Maybe some comments should be added for this. What do you think? @NicolasHug

NicolasHug commented 3 years ago

sure, any PR to improve the docs would be very welcome!

voldemortX commented 3 years ago

I'm not an expert on DCN right now... So maybe you'd like to send a PR for this? @Zhaoyi-Yan

Zhaoyi-Yan commented 3 years ago

I am not too...

voldemortX commented 3 years ago

@NicolasHug @Zhaoyi-Yan I'll try send a PR to clarify the doc after some paper re-reading, if no one more familiar with DCN turns up.

voldemortX commented 3 years ago

@Zhaoyi-Yan I've sent a PR for this. I now believe your initial guess is correct, if you consider the height direction as x, width direction as y.

dariofuoli commented 3 years ago

It would be very important to also know the order of the elements in (from docs):

(Tensor[batch_size, *2 offset_groups kernel_height kernel_width**, (offset) – out_height, out_width]): offsets to be applied for each position in the convolution kernel.

I.e., what is the arrangement of 2 offset_groups kernel_height * kernel_width, is it in this particular order? Considering the comments here, the following would be more likely: *offset_groups kernel_height kernel_width 2** with the kernel dimensions from left to right, top to bottom.

I think it could be made a lot clearer with passing a tensor instead of a flattened array: (offset_group x kernel_height x kernel_width x 2)

voldemortX commented 3 years ago

It would be very important to also know the order of the elements in (from docs):

(Tensor[batch_size, *2 offset_groups kernel_height kernel_width**, (offset) – out_height, out_width]): offsets to be applied for each position in the convolution kernel.

I.e., what is the arrangement of 2 offset_groups kernel_height * kernel_width, is it in this particular order? Considering the comments here, the following would be more likely: *offset_groups kernel_height kernel_width 2** with the kernel dimensions from left to right, top to bottom.

It is very confusing indeed. You could checkout this ongoing PR for some clarification (I tried but, the explanation there is still not very clear...)

I think it could be made a lot clearer with passing a tensor instead of a flattened array: (offset_group x kernel_height x kernel_width x 2)

I think that could introduce a BC-Break of some sort? Personally, I think maybe if deformable conv could be implemented as a pytorch layer, things would be much easier...

dariofuoli commented 3 years ago

I think that could introduce a BC-Break of some sort? Personally, I think maybe if deformable conv could be implemented as a pytorch layer, things would be much easier...

I am not a developer, but I think this might be handled with a fixed internal flatten operation, which can handle both inputs?

Personally, I think stating the exact order of elements encoded in the dimension "2 offset_groups kernel_height * kernel_width" in the docs would be sufficient, I like the functional approach of the current version.

Assuming the order: T in groups x kernel_height x kernel_width x [offset_h, offset_w] then stating that the "flattened tensor" to pass to the function will be: [T[0,0,0,0], T[0,0,0,1], T[0,0,1,0], T[0,0,1,1],...]

If this assumption is correct, for clarity, the docs should state: (Tensor[batch_size, *offset_groups kernel_height kernel_width 2**, (offset) – out_height, out_width]): offsets to be applied for each position in the convolution kernel.

lartpang commented 2 years ago

Maybe this demo will help us understand the role of offset.

import torch
from torchvision.ops import deform_conv2d

h = w = 3

# batch_size, num_channels, out_height, out_width
x = torch.arange(h * w * 3, dtype=torch.float32).reshape(1, 3, h, w)

# to show the effect of offset more intuitively, only the case of kh=kw=1 is considered here
offset = torch.FloatTensor(
    [  # create our predefined offset with offset_groups = 3
        0,
        -1,  # sample the left pixel of the centroid pixel
        0,
        1,  # sample the right pixel of the centroid pixel
        -1,
        0,  # sample the top pixel of the centroid pixel
    ]  # here, we divide the input channels into offset_groups groups with different offsets.
).reshape(1, 2 * 3 * 1 * 1, 1, 1)
# here we use the same offset for each local neighborhood in the single channel
# so we repeat the offset to the whole space: batch_size, 2 * offset_groups * kh * kw, out_height, out_width
offset = offset.repeat(1, 1, h, w)

weight = torch.FloatTensor(
    [
        [1, 0, 0],  # only extract the first channel of the input tensor
        [0, 1, 0],  # only extract the second channel of the input tensor
        [1, 1, 0],  # add the first and the second channels of the input tensor
        [0, 0, 1],  # only extract the third channel of the input tensor
        [0, 1, 0],  # only extract the second channel of the input tensor
    ]
).reshape(5, 3, 1, 1)
deconv_shift = deform_conv2d(x, offset=offset, weight=weight)
print(deconv_shift)

"""
tensor([[[[ 0.,  0.,  1.],  # offset=(0, -1) the first channel of the input tensor
          [ 0.,  3.,  4.],  # output hw indices (1, 2) => (1, 2-1) => input indices (1, 1)
          [ 0.,  6.,  7.]], # output hw indices (2, 1) => (2, 1-1) => input indices (2, 0)

         [[10., 11.,  0.],  # offset=(0, 1) the second channel of the input tensor
          [13., 14.,  0.],  # output hw indices (1, 1) => (1, 1+1) => input indices (1, 2)
          [16., 17.,  0.]], # output hw indices (2, 0) => (2, 0+1) => input indices (2, 1)

         [[10., 11.,  1.],  # offset=[(0, -1), (0, 1)], accumulate the first and second channels after being sampled with an offset.
          [13., 17.,  4.],
          [16., 23.,  7.]],

         [[ 0.,  0.,  0.],  # offset=(-1, 0) the third channel of the input tensor
          [18., 19., 20.],  # output hw indices (1, 1) => (1-1, 1) => input indices (0, 1)
          [21., 22., 23.]], # output hw indices (2, 2) => (2-1, 2) => input indices (1, 2)

         [[10., 11.,  0.],  # offset=(0, 1) the second channel of the input tensor
          [13., 14.,  0.],  # output hw indices (1, 1) => (1, 1+1) => input indices (1, 2)
          [16., 17.,  0.]]]])  # output hw indices (2, 0) => (2, 0+1) => input indices (2, 1)
"""

pytorch / vision

The document of torchvision.ops.deform_conv2d is not clear #3673

📚 Documentation