Open bjuncek opened 5 years ago
This is related to the discussion in https://github.com/pytorch/vision/pull/1224 and we should follow-up accordingly.
All right - for now, I won't change the models as they are, because they actually yield better performance, if it's ok with you, I'll just send a PR to document the issue, and close it?
Sounds good to me
After more thorough examination, and help from Du et al, it seems like there is a conceptual misunderstanding of the equation regarding calculation of midplanes in their paper. Here I propose a solution (BC breaking unfortunately), and more thorough discussion:
In order to match the number of parameters of a "vanilla" 3D resnet, R2+1D models actually increase the number of feature maps in separated convolutions. This increase is done according to the equation
This equation (or the paper) doesn't specify wether this corresponds to the residual blocks or individual layers. The difference is that one downsampling layer in our case will always have less parameters for simple block (and thus, it will have less overall parameters). For example the original implementation would look like the following (for the downsampling layer only, rest of them have the same number of parameters):
Conv2Plus1D(
(0): Conv3d(128, 288, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(1): BatchNorm3d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv3d(288, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)
while in our case, we'd have
Conv2Plus1D(
(0): Conv3d(128, 230, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(1): BatchNorm3d(230, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv3d(230, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)
all the way through.
In the bottleneck layers on the other hand, we have an opposite issue, as we'd always be overestimating the middle parameters because the output of the separated convolution is always smaller than the output of the whole block. For example
Bottleneck1:
3D conv (in: 64, 1,1,1, out: 64)
2+1D conv (
in: 64, 1,3,3, out: 144
in 144, 3, 1, 1, out: 64
)
3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2:
3D conv (in: 64*4, 1,1,1, out: 64)
2+1D conv (
in: 64, 1,3,3, out: 144
in 144, 3, 1, 1, out: 64
)
3D conv (in: 64, 1,1,1, out: 64*4)
When according to the block formula for the second bottlenect, one would expect to have the middle layer expanded as well, i.e.
Bottleneck1:
3D conv (in: 64, 1,1,1, out: 64)
2+1D conv (
in: 64, 1,3,3, out: 144
in 144, 3, 1, 1, out: 64
)
3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2:
3D conv (in: 64*4, 1,1,1, out: 64)
2+1D conv (
in: 64, 1,3,3, out: 177
in 177, 3, 1, 1, out: 64
)
3D conv (in: 64, 1,1,1, out: 64*4)
the solution is simple and should further boost the performance of pytorch R2+1D models, but it is BC breaking.
We could simply replace our R2+1D module with
class Conv2Plus1D(nn.Sequential):
def __init__(self,
in_planes,
out_planes,
stride=1,
padding=1):
midplanes = (in_planes * out_planes * 3 * 3 * 3) // (
in_planes * 3 * 3 + 3 * out_planes)
super(Conv2Plus1D, self).__init__(
nn.Conv3d(in_planes, midplanes, kernel_size=(1, 3, 3),
stride=(1, stride, stride), padding=(0, padding, padding),
bias=False),
nn.BatchNorm3d(midplanes),
nn.ReLU(inplace=True),
nn.Conv3d(midplanes, out_planes, kernel_size=(3, 1, 1),
stride=(stride, 1, 1), padding=(padding, 0, 0),
bias=False))
This would also allow us to remove midplanes from blocks which would in turn simplify them a bit. Let me know if this is worth the hassle :)
cc @fmassa per offline discussion
Removing midplanes
sounds good to me (although it's a BC-breaking change). This is how the original model is meant to be implemented, right?
I didn't quite understand why the proposed solution would change the model anyhow though. Wouldn't this be just moving https://github.com/pytorch/vision/blob/17e355f70267e0e01f7c4355a75d46b76f55b5aa/torchvision/models/video/resnet.py#L87 to https://github.com/pytorch/vision/blob/17e355f70267e0e01f7c4355a75d46b76f55b5aa/torchvision/models/video/resnet.py#L45 ? Doesn't this keep the same behavior as before?
Also, didn't the models have the same number of parameters as the Caffe2-equivalent?
Removing midplanes sounds good to me (although it's a BC-breaking change). This is how the original model is meant to be implemented, right?
Yes. Or at least that's how it is implemented in the repo of the original paper.
I didn't quite understand why the proposed solution would change the model anyhow though. Because if you define it as it is now you have the same midplanes value for every convolution in every block, but in reality (i.e. their implementation) that's not the case.
Since example is worth a thousand words, imagine a fictional r25d that has multiple simpleblocks in first res block. mp_ours
defines the number of midplanes as defined by us, and mp_vmz
is number of midplanes as defined by the reference implementation.
SB1:
64->128; (mp_ours = 230, mp_vmz=230)
128->128 (mp_ours = 230, mp_vmz=288)
SB2:
128->128 (mp_ours = 288, mp_vmz=288)
128->128 (mp_ours = 288, mp_vmz=288)
SB3:
128->128 (mp_ours = 288, mp_vmz=288)
128->128 (mp_ours = 288, mp_vmz=288)
...
...
We have two places where we pass in_channels
and out_channels
; first one is in the block definition, and the other one is in each convolution of the given block. The second convolut of the first simple block actually has in_channels==out_channels
, but since we calculate midplanes when we define the block, we underestimate what that number should be for the second conv of the first simple block in each "layer" of the resnet.
Moving the calculation of midplanes to each convolution definition rather than to each residual block definition fixes this issue.
Also, didn't the models have the same number of parameters as the Caffe2-equivalent?
Yes, but up to the 3rd significant digit (the num parameters in caffe2 is given as 33.8M), which allows for quite a lot of variation. This is not a "major" difference in num of parameters, but enough to not make it compatible with Caffe2 pretrained models
Ok, sounds good.
As this is a BC-breaking change, I believe we should do something similar to what we did for MNasNet in https://github.com/pytorch/vision/pull/1224, with the _version
attribute, if it's not too much of a trouble.
Yeah, I'll take a stab at that
I found that there is a lack of clarity in the original R2+1D paper and official code implementation for models utilizing BottleNeck layers, which makes it impossible to transfer weights from large pretrained models in C2 to the models implemented in torchvision.
For background info see the question in their repo.
The fix is very straightforward (change bottleneck midplanes computation), but the question is whether we should do it, which I suspect should be based on author's answer. I'm leaving this here just so that people are aware of it.