sail-sg / volo

VOLO: Vision Outlooker for Visual Recognition
Apache License 2.0
921 stars 94 forks source link

The Equ.5 and the operate `fold` in the paper do not seem to be consistent. #7

Open lartpang opened 3 years ago

lartpang commented 3 years ago

The Equ.5:

image

In my opinion, this equ calculates the sum of features in the neighborhood corresponding to (i,j).

But in the code:

https://github.com/sail-sg/volo/blob/1f67923404d85cb8012a61b35d7eff782fe90cef/models/volo.py#L94-L95

F.fold(x, output_size=(H, W), ...) implements another operation.

Let's start with a simple example.

a = torch.randperm(18).float().reshape(1, 2, 3, 3)  # B=1, C=2, H=W=3

a
Out[33]: 
tensor([[[[ 7., 14.,  0.],  # the first channel
          [ 8., 15., 16.],
          [ 6.,  4., 12.]],
         [[ 2., 17., 13.],  # the first channel
          [10.,  9., 11.],
          [ 5.,  3.,  1.]]]])

Collect the adjacent features in the local region 3x3 and stack the nine values along the channel dimension.

unfold_a = F.unfold(a, kernel_size=3, padding=1)  # 1 x C*K*K x H*W

The id of the row indexes the position of the HW space, the number of ids is HW=9

The id of the column indexes the position of the K*K neighbors in different channels.

For the column, it should be noted that this is channel-independent.

That is, first stack the data inside the channels, and then stack the data of different channels in sequence.

For example, the fourth line [ 0., 7., 14., 0., 8., 15., 0., 6., 4.] represents the fourth neighbor in the different windows from the first channel plane.

Specifically, for the first element 0, it represents the fourth value in the flattened K*K window ([[0, 0, 0], [0, 7, 4], [0, 8, 15]]) located at the upper left corner of the padded a.

unfold_a
Out[35]: 
tensor([[
         # the first channel
         [ 0.,  0.,  0.,  0.,  7., 14.,  0.,  8., 15.],  
         [ 0.,  0.,  0.,  7., 14.,  0.,  8., 15., 16.],         
         [ 0.,  0.,  0., 14.,  0.,  0., 15., 16.,  0.],
         [ 0.,  7., 14.,  0.,  8., 15.,  0.,  6.,  4.],
         [ 7., 14.,  0.,  8., 15., 16.,  6.,  4., 12.],
         [14.,  0.,  0., 15., 16.,  0.,  4., 12.,  0.],
         [ 0.,  8., 15.,  0.,  6.,  4.,  0.,  0.,  0.],
         [ 8., 15., 16.,  6.,  4., 12.,  0.,  0.,  0.],
         [15., 16.,  0.,  4., 12.,  0.,  0.,  0.,  0.],
         # the second channel
         [ 0.,  0.,  0.,  0.,  2., 17.,  0., 10.,  9.],
         [ 0.,  0.,  0.,  2., 17., 13., 10.,  9., 11.],
         [ 0.,  0.,  0., 17., 13.,  0.,  9., 11.,  0.],
         [ 0.,  2., 17.,  0., 10.,  9.,  0.,  5.,  3.],
         [ 2., 17., 13., 10.,  9., 11.,  5.,  3.,  1.],
         [17., 13.,  0.,  9., 11.,  0.,  3.,  1.,  0.],
         [ 0., 10.,  9.,  0.,  5.,  3.,  0.,  0.,  0.],
         [10.,  9., 11.,  5.,  3.,  1.,  0.,  0.,  0.],
         [ 9., 11.,  0.,  3.,  1.,  0.,  0.,  0.,  0.]]])

the original data

a
Out[38]: 
tensor([[[[ 7., 14.,  0.],
          [ 8., 15., 16.],
          [ 6.,  4., 12.]],
         [[ 2., 17., 13.],
          [10.,  9., 11.],
          [ 5.,  3.,  1.]]]])

Inconsistent with the equation of the paper.

fold_a = F.fold(unfold_a, output_size=(3, 3), kernel_size=3, padding=1)

fold_a
Out[40]: 
tensor([[[[ 28.,  84.,   0.],
          [ 48., 135.,  96.],
          [ 24.,  24.,  48.]],
         [[  8., 102.,  52.],
          [ 60.,  81.,  66.],
          [ 20.,  18.,   4.]]]])

For simplicity, we only look at the first channel plane.

[[ 28.,  84.,   0.],
  [ 48., 135.,  96.],
  [ 24.,  24.,  48.]]

Comparing the previous unfold_a, we can find that the value of each element here is actually the result of the accumulation of different windows after returning to their original position in padded a:

  1. In the first step, the values of different windows are placed back to their original corresponding positions in sequence.
  2. In the second step, the values corresponding to the same element positions are accumulated.
    1. For examples, fold_a[0, 0, 0, 1] = 84 = unfold_a[0, 5, 0] + unfold_a[0, 4, 1] + unfold_a[0, 3, 2] + unfold_a[0, 2, 3]+ unfold_a[0, 1, 4] + unfold_a[0, 0, 5]. These values come from different windows instead of the "same window" expressed in the equ. 5 of the paper.
  3. The third step is to remove padding.
wuyongfa-genius commented 3 years ago

@lartpang Agree with you completely. And I also refer to the pytorch documentation

Fold calculates each combined value in the resulting large tensor by summing all values from all containing blocks. Unfold extracts the values in the local blocks by copying from the large tensor. So, if the blocks overlap, they are not inverses of each other.(https://pytorch.org/docs/stable/generated/torch.nn.Fold.html?highlight=fold#torch.nn.Fold)

zihangJiang commented 3 years ago

Note the definition of in equation 3.

image

It's a local window centered at (i,j). And here in equation 5

image the summation is over (m,n), which means gathering vectors from different windows but same original element location (i,j), same as step 1 and 2 in your comment @lartpang .

The notation is a little tricky, but it does not calculate the sum of features in the neighborhood corresponding to (i,j), but doing a fold op.

lartpang commented 3 years ago

@zihangJiang

Ah oh, it looks like this is indeed the case, my understanding does have errors. In my opinion, the paper lacks emphasis on this. After all, such cross-window integration makes the Outlooker Attention proposed in this paper significantly different from other attention strategies.

zihangJiang commented 3 years ago

@lartpang Agree, some additional emphasis and explanation for the notation could be added for better understanding.

monney commented 3 years ago

Interesting, I also thought we were summing over just the kernel, as in a convolution. This answers my question from the other thread, and explains why we need to aggregate for all pixels, not just the center. The fold makes the effective neighborhood size actually 2K-1 for each pixel instead of just K. Very efficient.

Jin-huihuang commented 2 years ago

https://github.com/sail-sg/volo/blob/56983cd9e04da0d1a37ecc46dba970610a8a7497/models/volo.py#L72 How big is the AvgPool2d kernel_size?If the kernel size here is 2, how to locate the center point of the original k*k windows?