[FYI] Bug in R2+1D implementation

bjuncek commented 5 years ago

I found that there is a lack of clarity in the original R2+1D paper and official code implementation for models utilizing BottleNeck layers, which makes it impossible to transfer weights from large pretrained models in C2 to the models implemented in torchvision.

For background info see the question in their repo.

The fix is very straightforward (change bottleneck midplanes computation), but the question is whether we should do it, which I suspect should be based on author's answer. I'm leaving this here just so that people are aware of it.

fmassa commented 5 years ago

This is related to the discussion in https://github.com/pytorch/vision/pull/1224 and we should follow-up accordingly.

bjuncek commented 5 years ago

All right - for now, I won't change the models as they are, because they actually yield better performance, if it's ok with you, I'll just send a PR to document the issue, and close it?

fmassa commented 5 years ago

Sounds good to me

bjuncek commented 4 years ago

After more thorough examination, and help from Du et al, it seems like there is a conceptual misunderstanding of the equation regarding calculation of midplanes in their paper. Here I propose a solution (BC breaking unfortunately), and more thorough discussion:

Background

In order to match the number of parameters of a "vanilla" 3D resnet, R2+1D models actually increase the number of feature maps in separated convolutions. This increase is done according to the equation

$\frac{td^2*N_{i-1}N_i}{d^2*N_{i-1}+tN_i}$

Issue

This equation (or the paper) doesn't specify wether this corresponds to the residual blocks or individual layers. The difference is that one downsampling layer in our case will always have less parameters for simple block (and thus, it will have less overall parameters). For example the original implementation would look like the following (for the downsampling layer only, rest of them have the same number of parameters):

Conv2Plus1D(
  (0): Conv3d(128, 288, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(288, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

while in our case, we'd have

Conv2Plus1D(
  (0): Conv3d(128, 230, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(230, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(230, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

all the way through.

In the bottleneck layers on the other hand, we have an opposite issue, as we'd always be overestimating the middle parameters because the output of the separated convolution is always smaller than the output of the whole block. For example

Bottleneck1: 
  3D conv (in: 64, 1,1,1, out: 64) 
  2+1D conv (
    in: 64, 1,3,3, out: 144
    in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2: 
  3D conv (in: 64*4, 1,1,1, out: 64)
  2+1D conv (
      in: 64, 1,3,3, out: 144
      in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)

When according to the block formula for the second bottlenect, one would expect to have the middle layer expanded as well, i.e.

Bottleneck1: 
  3D conv (in: 64, 1,1,1, out: 64) 
  2+1D conv (
    in: 64, 1,3,3, out: 144
    in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2: 
  3D conv (in: 64*4, 1,1,1, out: 64)
  2+1D conv (
      in: 64, 1,3,3, out: 177
      in 177, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)

Solution

the solution is simple and should further boost the performance of pytorch R2+1D models, but it is BC breaking.

We could simply replace our R2+1D module with

class Conv2Plus1D(nn.Sequential):

    def __init__(self,
                 in_planes,
                 out_planes,
                 stride=1,
                 padding=1):

        midplanes = (in_planes * out_planes * 3 * 3 * 3) // (
                in_planes * 3 * 3 + 3 * out_planes)
        super(Conv2Plus1D, self).__init__(
            nn.Conv3d(in_planes, midplanes, kernel_size=(1, 3, 3),
                      stride=(1, stride, stride), padding=(0, padding, padding),
                      bias=False),
            nn.BatchNorm3d(midplanes),
            nn.ReLU(inplace=True),
            nn.Conv3d(midplanes, out_planes, kernel_size=(3, 1, 1),
                      stride=(stride, 1, 1), padding=(padding, 0, 0),
                      bias=False))

This would also allow us to remove midplanes from blocks which would in turn simplify them a bit. Let me know if this is worth the hassle :)

cc @fmassa per offline discussion

fmassa commented 4 years ago

Removing midplanes sounds good to me (although it's a BC-breaking change). This is how the original model is meant to be implemented, right?

I didn't quite understand why the proposed solution would change the model anyhow though. Wouldn't this be just moving https://github.com/pytorch/vision/blob/17e355f70267e0e01f7c4355a75d46b76f55b5aa/torchvision/models/video/resnet.py#L87 to https://github.com/pytorch/vision/blob/17e355f70267e0e01f7c4355a75d46b76f55b5aa/torchvision/models/video/resnet.py#L45 ? Doesn't this keep the same behavior as before?

Also, didn't the models have the same number of parameters as the Caffe2-equivalent?

bjuncek commented 4 years ago

Removing midplanes sounds good to me (although it's a BC-breaking change). This is how the original model is meant to be implemented, right?

Yes. Or at least that's how it is implemented in the repo of the original paper.

I didn't quite understand why the proposed solution would change the model anyhow though. Because if you define it as it is now you have the same midplanes value for every convolution in every block, but in reality (i.e. their implementation) that's not the case.

Since example is worth a thousand words, imagine a fictional r25d that has multiple simpleblocks in first res block. mp_ours defines the number of midplanes as defined by us, and mp_vmz is number of midplanes as defined by the reference implementation.

SB1: 
  64->128; (mp_ours = 230, mp_vmz=230)
  128->128 (mp_ours = 230, mp_vmz=288)
SB2:
  128->128 (mp_ours = 288, mp_vmz=288)
  128->128 (mp_ours = 288, mp_vmz=288)
SB3:
  128->128 (mp_ours = 288, mp_vmz=288)
  128->128 (mp_ours = 288, mp_vmz=288)

...
...

We have two places where we pass in_channels and out_channels; first one is in the block definition, and the other one is in each convolution of the given block. The second convolut of the first simple block actually has in_channels==out_channels, but since we calculate midplanes when we define the block, we underestimate what that number should be for the second conv of the first simple block in each "layer" of the resnet.

Moving the calculation of midplanes to each convolution definition rather than to each residual block definition fixes this issue.

bjuncek commented 4 years ago

Also, didn't the models have the same number of parameters as the Caffe2-equivalent?

Yes, but up to the 3rd significant digit (the num parameters in caffe2 is given as 33.8M), which allows for quite a lot of variation. This is not a "major" difference in num of parameters, but enough to not make it compatible with Caffe2 pretrained models

fmassa commented 4 years ago

Ok, sounds good.

As this is a BC-breaking change, I believe we should do something similar to what we did for MNasNet in https://github.com/pytorch/vision/pull/1224, with the _version attribute, if it's not too much of a trouble.

bjuncek commented 4 years ago

Yeah, I'll take a stab at that

pytorch / vision