mapillary / inplace_abn

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs
BSD 3-Clause "New" or "Revised" License
1.32k stars 187 forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #6

Closed mingminzhen closed 6 years ago

mingminzhen commented 6 years ago

I try to use the ABN, InPlaceABN, InPlaceABNSync. But some errors occur.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

I test it on Pytorch-0.2, cudnnv7, cuda8.

rotabulo commented 6 years ago

@mingminzhen you cannot have two subsequent inplace operations in your computation graph. E.g., if you use InPlaceABN, the preceding and subsequent layer should not be inplace. If you use ABN and you use an inplace activation function (e.g. RELU with inplace set to True) then the subsequent layer cannot be inplace. There is a violation of this requirement somewhere in your net.

PkuRainBow commented 6 years ago

@rotabulo Sorry. I want to use the InPlaceABN for multi-gpu training. But you mean that I can not use subsequent InplaceABN operations in Resnet like:

        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = InPlaceABNSync(planes, affine = affine_par, activation="none")
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = InPlaceABNSync(planes, affine = affine_par, activation="none")
        self.downsample = downsample
        self.stride = stride
rotabulo commented 6 years ago

@PkuRainBow the code above does not show how you actually create the computation graph, but if I get your intention correctly, you are going to apply in-place relu to the output of inplaceABN. This is not possible, because inplace operations should be followed by non-inplace operations. You should use InPlaceABN with an embedded invertible activation function like leaky-relu and not with activation="none", otherwise you loose the memory saving capabilities.

PkuRainBow commented 6 years ago

@rotabulo I find the same error with the nn.ReLU(inplace=False)

PkuRainBow commented 6 years ago

I mean can i operate the InplaceABNSync in sequence order like this

        self.conv1 = nn.Sequential(AdaptiveAvgPool2d((1,1)),
                                   nn.Conv2d(features, out_features, kernel_size=1, padding=0, dilation=1, bias=False),
                                   InPlaceABNSync(out_features))
        self.conv2 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=1, padding=0, dilation=1, bias=False),
                                   InPlaceABNSync(out_features))
        self.conv3 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=3, padding=dilations[0], dilation=dilations[0], bias=False),
                                   InPlaceABNSync(out_features))
        self.conv4 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=3, padding=dilations[1], dilation=dilations[1], bias=False),
                                   InPlaceABNSync(out_features))
        self.conv5 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=3, padding=dilations[2], dilation=dilations[2], bias=False),
                                   InPlaceABNSync(out_features))

# later sequence forward computation 
rotabulo commented 6 years ago

@PkuRainBow if you are using the standard resnet module provided by pytorch, you probably still have two consecutive in-place operations due to this line https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L90. Is this the case?

Regarding your last question. You are not showing the computation graph, but if your intention is to compose the different self.conv then that's possible, because the operation that follows InPlaceABN is a Conv.

PkuRainBow commented 6 years ago

@rotabulo Great! So how should I change this line ?

out += residual

like this?

out = out + residual

I have fixed all the problems! Great thanks to @rotabulo !

rotabulo commented 6 years ago

@PkuRainBow yes, the solution you suggest solves the consecutive in-place issue, and yes you loose some memory savings. We indeed opted for ResNetv2 because it exhibits BN-act patterns everywhere, while this is not the case for ResNetv1. The latter version of Resnet can still be optimized in terms of memory, but requires some more modifications.

mingminzhen commented 6 years ago

@rotabulo How about the denseNet, I try the official densenet. Now there is no problem for ABN. But error is still for InplaceABN.

mingminzhen commented 6 years ago

The code for denselayer is

class _DenseLayer(nn.Sequential):
    def __init__(self, num_input_features, 
                       growth_rate, 
                       bn_size, 
                       drop_rate,
                       bn_method = 'InPlaceABN'):
        super(_DenseLayer, self).__init__()
        if bn_method == 'InPlaceABN':
            self.add_module('norm_relu.1', ABN.InPlaceABN(num_input_features, activation='relu'))
        elif bn_method == 'InPlaceABNSync': 
            self.add_module('norm_relu_sync.1', ABN.InPlaceABNSync(num_input_features,
                                                                   activation = 'relu'))   
        elif bn_method == 'ABN':
            self.add_module('abn.1',ABN.ABN(num_input_features))
        else:
            self.add_module('norm.1', nn.BatchNorm2d(num_input_features))
            self.add_module('relu.1', nn.ReLU(inplace=True))

        self.add_module('conv.1', nn.Conv2d(num_input_features, bn_size *
                        growth_rate, kernel_size=1, stride=1, bias=False))

        if bn_method == 'InPlaceABN': 
            self.add_module('norm_relu.2', ABN.InPlaceABN(bn_size * growth_rate, 
                                                         activation='relu'))
        elif bn_method == 'InPlaceABNSync': 
            self.add_module('norm_relu_sync.2', ABN.InPlaceABNSync(bn_size * growth_rate,
                                                                   activation = 'relu'))   
        elif bn_method == 'ABN':
            self.add_module('abn.2',ABN.ABN(bn_size * growth_rate))
        else:
            self.add_module('norm.2', nn.BatchNorm2d(bn_size * growth_rate)),
            self.add_module('relu.2', nn.ReLU(inplace=True)),
        self.add_module('conv.2', nn.Conv2d(bn_size * growth_rate, growth_rate,
                        kernel_size=3, stride=1, padding=1, bias=False)),
        self.drop_rate = drop_rate

    def forward(self, x):
        new_features = super(_DenseLayer, self).forward(x)
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
        return torch.cat([x, new_features], 1)

The _transition layer is

class _Transition(nn.Sequential):
    def __init__(self, num_input_features, 
                       num_output_features,
                       bn_method = 'InPlaceABN'):
        super(_Transition, self).__init__()
        if bn_method == 'InPlaceABN':
            self.add_module('norm_relu', ABN.InPlaceABN(num_input_features, activation='relu'))
        elif bn_method == 'InPlaceABNSync': 
            self.add_module('norm_relu_sync', ABN.InPlaceABNSync(num_input_features,
                                                                   activation = 'relu'))   
        elif bn_method == 'ABN':
            self.add_module('abn', ABN.ABN(num_input_features))
        else:
            self.add_module('norm', nn.BatchNorm2d(num_input_features))
            self.add_module('relu', nn.ReLU(inplace=True))

        self.add_module('conv', nn.Conv2d(num_input_features, num_output_features,
                                          kernel_size=1, stride=1, bias=False))
        self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))
rotabulo commented 6 years ago

@mingminzhen You cannot use InPlaceABN with activation="relu" because on one hand it's not implemented and on the other hand it's not invertible and this is a requirement (see arxiv paper for technical details). You can use "leaky_relu" or "elu". Please note that we have recently added also the densenet model (https://github.com/mapillary/inplace_abn/blob/master/models/densenet.py).

mingminzhen commented 6 years ago

It's helpful. thanks.

PkuRainBow commented 6 years ago

@rotabulo It would be great if you could share me the ResNet V2 with InplaceABN. In fact, I do not find memory savings with InplaceABN as I have to give up some original inplace operations. As the basic block relies highly on the out += residual

replacing it with the below operation is indeed expensive !

out = out + residual

Or how about replacing part of the BN with InplaceABN and remains the BN that connects to the original inplace operations?

Have you ever conducted experiments mixing up the basic BN and the InplaceABN?

ducksoup commented 6 years ago

@PkuRainBow Unfortunately we do not have a pre-trained ResNet v2. However, it is very easy to obtain the correct network structure for a ResNet v2 using our code, as it is a special case of ResNeXt, for which we provide an implementation. In particular, you need to specify groups=1 and base_channels=(64, 64, 256) in the constructor, e.g. to obtain a ResNet152 v2 with InPlaceABN you can use:

from models.resnext import ResNeXt
from modules import InPlaceABN

res_net152 = ResNeXt([3, 8, 36, 3],
                     norm_act=InPlaceABN,
                     groups=1,
                     base_channels=(64, 64, 256))
PkuRainBow commented 6 years ago

@ducksoup @rotabulo Thanks for your great reply! Besides, I am wondering about the other solutions to avoid the memory cost of the out = out + residual in the Bottle-Neck-Block in ResNet-V1(I do not want to replace it with ResNet-v2 currently due to some reasons.). As I find your paper report that you can save about 50% memory cost. But I have replaced all the BN with InplaceABNSync and change the out += residual to out = out + residual while keeping all the other staff same. But I find that no memory is saved, which is very strange. Could you help me?

mingminzhen commented 6 years ago

@rotabulo @ducksoup Could you show your ASPP code which is for deeplabV3? I try to implement it myself, but there is still error "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation" In fact, the previous layer is conv layer. So there are no two subsequent inplace layers. It's weird. Could you help me?

class ASPP_ConvLayer(nn.Sequential):
    def __init__(self, input_channels, 
                       output_channels,
                       norm_act,
                       kernelSize = 3,
                       Stride     = 1, 
                       droprate   = 0,
                       padding    = 0,
                       dilation   = 1):
        super(ASPP_ConvLayer,self).__init__()
        # self.add_module('norm1',nn.BatchNorm2d(input_channels))
        # self.add_module('relu', nn.ReLU(inplace=True))        
        self.add_module('abn', norm_act(input_channels))
        self.add_module('conv3',nn.Conv2d(input_channels,
                                          output_channels,
                                          kernel_size = kernelSize,
                                          stride      = 1,
                                          padding     = padding,
                                          dilation    = dilation,
                                          bias        = False))
        self.drop_rate = droprate
    def forward(self,x):
        new_features=super(ASPP_ConvLayer,self).forward(x)
        return new_features

class ASPP_Module(nn.Module):
    def __init__(self,input_channels,
                      output_channels,
                      norm_act):
        super(ASPP_Module, self).__init__()
        dilation=[1,6,12,18]
        self.conv2d_list = nn.Sequential()
        #1x1 conv    
        self.conv2d_list.add_module('aspp1', ASPP_ConvLayer(input_channels,input_channels,
                                            kernelSize=1,Stride=1, droprate=0.2,
                                            padding =0, dilation = dilation[0],
                                            norm_act =norm_act))
        #3x3 conv rate=6
        self.conv2d_list.add_module('aspp2', ASPP_ConvLayer(input_channels,input_channels,
                                            kernelSize=3,Stride=1, droprate=0.2,
                                            padding =dilation[1], 
                                            dilation = dilation[1],
                                            norm_act =norm_act))
        #3x3 conv rate=12
        self.conv2d_list.add_module('aspp3', ASPP_ConvLayer(input_channels,input_channels,
                                            kernelSize=3,Stride=1, droprate=0.2,
                                            padding =dilation[2], 
                                            dilation = dilation[2],
                                            norm_act =norm_act))
        #3x3 conv rate=18
        self.conv2d_list.add_module('aspp4', ASPP_ConvLayer(input_channels,input_channels,
                                            kernelSize=3,Stride=1, droprate=0.2,
                                            padding =dilation[3], 
                                            dilation = dilation[3],
                                            norm_act =norm_act))

        self.output_layer=nn.Sequential()
        # self.output_layer.add_module('norm', nn.BatchNorm2d(input_channels*5))
        # self.output_layer.add_module('relu', nn.ReLU(inplace=True))        
                # self.add_module('abn', norm_act(input_channels))
        self.output_layer.add_module('abn', norm_act(input_channels))
        self.output_layer.add_module('conv',nn.Conv2d(input_channels*5,
                                                         output_channels,
                                                         kernel_size = 1,
                                                         stride      = 1,
                                                         padding     = 0,
                                                         dilation    = 1,
                                                         bias        = False))
    def forward(self, x):
        # y=self.compress_layer(x)
        y1 = self.conv2d_list.aspp1(x)
        y2 = self.conv2d_list.aspp2(x)
        y3 = self.conv2d_list.aspp3(x)
        y4 = self.conv2d_list.aspp4(x)
        final_out = torch.cat([x,y1,y2,y3,y4],1)
        final_out = self.output_layer(final_out)
        return final_out
rotabulo commented 6 years ago

@mingminzhen I don't see evident issues in the code snipped you provided. The only potential source of the error might come from x given in input to the forward function. Indeed, the first operation that you apply to x is norm_act. If the latter was an inplace operation (e.g. InplaceABN) and x was generated by an inplace operation, then you would get a runtime error. Can you check what's the last operation used to generate x ?

rotabulo commented 6 years ago

@PkuRainBow how do you check the memory savings? Using just nvidia-smi is not the correct way, because PyTorch uses an internal cache, which keeps allocated buffers also if they are not actually used.

mingminzhen commented 6 years ago

@rotabulo Is it possible that InplaceABN will inplace x at aspp1 part. Then for aspp2(x), the x is not the original input, it's in fact y1?

rotabulo commented 6 years ago

@mingminzhen Now I see, it's applied in parallel. Yes this is definitely the issue. The aspp1 modifies x, and then aspp2 issues a second inplace operation on x. You should simply apply inplaceABN once to x in ASPP_Module, and not multiple times in each ASPP_ConvLayer.

PkuRainBow commented 6 years ago

@rotabulo So how should I check the memory saving? Do you mean I can increase the batch size even with the same number shown by the nvidia-smi?

rotabulo commented 6 years ago

@PkuRainBow exactly.. just increase the batch size until it issues an out of memory exception. And compare how far you get using ABN (standard setting) vs InPlaceABN.

mingminzhen commented 6 years ago

@rotabulo If the issue is the parallel structure, then for densenet (torch.cat([f(x),x]))or resnet(f(x)+x), is the secondx not the original input? It's similar in your DenseModule.

class DenseModule(nn.Module):
    def __init__(self, in_channels, growth, layers, bottleneck_factor=4, norm_act=ABN, dilation=1):
        super(DenseModule, self).__init__()
        self.in_channels = in_channels
        self.growth = growth
        self.layers = layers

        self.convs1 = nn.ModuleList()
        self.convs3 = nn.ModuleList()
        for i in range(self.layers):
            self.convs1.append(nn.Sequential(OrderedDict([
                ("bn", norm_act(in_channels)),
                ("conv", nn.Conv2d(in_channels, self.growth * bottleneck_factor, 1, bias=False))
            ])))
            self.convs3.append(nn.Sequential(OrderedDict([
                ("bn", norm_act(self.growth * bottleneck_factor)),
                ("conv", nn.Conv2d(self.growth * bottleneck_factor, self.growth, 3, padding=dilation, bias=False,
                                   dilation=dilation))
            ])))
            in_channels += self.growth

    @property
    def out_channels(self):
        return self.in_channels + self.growth * self.layers

    def forward(self, x):
        inputs = [x]
        for i in range(self.layers):
            x = torch.cat(inputs, dim=1)
            x = self.convs1[i](x)
            x = self.convs3[i](x)
            inputs += [x]

        return torch.cat(inputs, dim=1)
mingminzhen commented 6 years ago

Does torch.cat concatenate and copy the inputs? Then the original one is unchanged. So if I want to implement the ASPP structure. I need to copy the input firstly. Right?

mingminzhen commented 6 years ago

I see. In your IdentityResidualBlock, you use shortcut = x.clone().

rotabulo commented 6 years ago

@mingminzhen The issue is not the parallel structure per se, but the wrong way of using InPlaceABN in the parallel structure. torch.cat is not an in-place operation and we use x.clone() in 'IdentityResidualBlock' to prevent from having two consecutive inplace operations (i.e. add_ and InplaceABN).

mingminzhen commented 6 years ago

@rotabulo thanks.

PkuRainBow commented 6 years ago

@rotabulo Thanks! Besides, I am wondering the difference between ABN and inPlaceSyncABN, the latter one is expected to save more memory or just support multi-gpu?

rotabulo commented 6 years ago

@PkuRainBow
ABN is standard BN + activation (no memory savings). InPlaceABN is BN+activation done inplace (with memory savings). InPlaceABNSyncis BN+activation done inplace (with memory savings) + computation of BN (fwd+bwd) with data from all the gpus.

mingminzhen commented 6 years ago

@rotabulo I test the InplaceABN on PyTorch v3 (torch-0.3.0.post4-cp35-cp35m-linux_x86_64.whl). It seems there is no error. Is the bug solved as you mentioned in the readme?

rotabulo commented 6 years ago

@mingminzhen We know that the issue has been solved in master, and from the pytorch issue thread they write that they have not fixed it in v0.3. However, the bug occurs apparently only with python 2.7 and not with python 3.x. If you are in the latter case, then the code should run without issues.

mingminzhen commented 6 years ago

@rotabulo Another question is about semantic segmentation. Do your pre-train the mdoel on MScoco or other data for the Cityscape dataset?

PkuRainBow commented 6 years ago

@rotabulo I also got the the similar bug ....and replace the pytorch v0.2....But v0,2 is much slower than v0.3 .......

rotabulo commented 6 years ago

@PkuRainBow this is unfortunately out of our control. Pytorch v0.3 with python 2.7 is buggy. Pytorch v0.3 with python 3.x should work. Also master with python 2.7 should work. Maybe you can give a try.

John1231983 commented 6 years ago

I think his implementation of ASPP is missing global average pooling. Am I right? Btw, I did not find your implementation of Deeplabv3 in this probject. Where is it? Thanks

dongfengxijian commented 5 years ago

@mingminzhen How you sloved it? Can you tell me?

bonlime commented 5 years ago

For anyone who also has this problem: In Resnet / ResNeXt / Se-Resnet changing:

out = self.bn3(out)
out += residual

to

out = self.bn3(out) + residual

fixes the problem and don't give any overhead. While changing out += residual to out = out + residual gives a huge speed penalty.