Closed mingminzhen closed 6 years ago
@mingminzhen you cannot have two subsequent inplace operations in your computation graph. E.g., if you use InPlaceABN, the preceding and subsequent layer should not be inplace. If you use ABN and you use an inplace activation function (e.g. RELU with inplace set to True) then the subsequent layer cannot be inplace. There is a violation of this requirement somewhere in your net.
@rotabulo Sorry. I want to use the InPlaceABN for multi-gpu training. But you mean that I can not use subsequent InplaceABN operations in Resnet like:
self.conv1 = conv3x3(inplanes, planes, stride)
self.bn1 = InPlaceABNSync(planes, affine = affine_par, activation="none")
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(planes, planes)
self.bn2 = InPlaceABNSync(planes, affine = affine_par, activation="none")
self.downsample = downsample
self.stride = stride
@PkuRainBow the code above does not show how you actually create the computation graph, but if I get your intention correctly, you are going to apply in-place relu to the output of inplaceABN. This is not possible, because inplace operations should be followed by non-inplace operations. You should use InPlaceABN with an embedded invertible activation function like leaky-relu and not with activation="none", otherwise you loose the memory saving capabilities.
@rotabulo I find the same error with the nn.ReLU(inplace=False)
I mean can i operate the InplaceABNSync in sequence order like this
self.conv1 = nn.Sequential(AdaptiveAvgPool2d((1,1)),
nn.Conv2d(features, out_features, kernel_size=1, padding=0, dilation=1, bias=False),
InPlaceABNSync(out_features))
self.conv2 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=1, padding=0, dilation=1, bias=False),
InPlaceABNSync(out_features))
self.conv3 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=3, padding=dilations[0], dilation=dilations[0], bias=False),
InPlaceABNSync(out_features))
self.conv4 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=3, padding=dilations[1], dilation=dilations[1], bias=False),
InPlaceABNSync(out_features))
self.conv5 = nn.Sequential(nn.Conv2d(features, out_features, kernel_size=3, padding=dilations[2], dilation=dilations[2], bias=False),
InPlaceABNSync(out_features))
# later sequence forward computation
@PkuRainBow if you are using the standard resnet module provided by pytorch, you probably still have two consecutive in-place operations due to this line https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L90. Is this the case?
Regarding your last question. You are not showing the computation graph, but if your intention is to compose the different self.conv then that's possible, because the operation that follows InPlaceABN is a Conv.
@rotabulo Great! So how should I change this line ?
out += residual
like this?
out = out + residual
I have fixed all the problems! Great thanks to @rotabulo !
@PkuRainBow yes, the solution you suggest solves the consecutive in-place issue, and yes you loose some memory savings. We indeed opted for ResNetv2 because it exhibits BN-act patterns everywhere, while this is not the case for ResNetv1. The latter version of Resnet can still be optimized in terms of memory, but requires some more modifications.
@rotabulo How about the denseNet, I try the official densenet. Now there is no problem for ABN. But error is still for InplaceABN.
The code for denselayer is
class _DenseLayer(nn.Sequential):
def __init__(self, num_input_features,
growth_rate,
bn_size,
drop_rate,
bn_method = 'InPlaceABN'):
super(_DenseLayer, self).__init__()
if bn_method == 'InPlaceABN':
self.add_module('norm_relu.1', ABN.InPlaceABN(num_input_features, activation='relu'))
elif bn_method == 'InPlaceABNSync':
self.add_module('norm_relu_sync.1', ABN.InPlaceABNSync(num_input_features,
activation = 'relu'))
elif bn_method == 'ABN':
self.add_module('abn.1',ABN.ABN(num_input_features))
else:
self.add_module('norm.1', nn.BatchNorm2d(num_input_features))
self.add_module('relu.1', nn.ReLU(inplace=True))
self.add_module('conv.1', nn.Conv2d(num_input_features, bn_size *
growth_rate, kernel_size=1, stride=1, bias=False))
if bn_method == 'InPlaceABN':
self.add_module('norm_relu.2', ABN.InPlaceABN(bn_size * growth_rate,
activation='relu'))
elif bn_method == 'InPlaceABNSync':
self.add_module('norm_relu_sync.2', ABN.InPlaceABNSync(bn_size * growth_rate,
activation = 'relu'))
elif bn_method == 'ABN':
self.add_module('abn.2',ABN.ABN(bn_size * growth_rate))
else:
self.add_module('norm.2', nn.BatchNorm2d(bn_size * growth_rate)),
self.add_module('relu.2', nn.ReLU(inplace=True)),
self.add_module('conv.2', nn.Conv2d(bn_size * growth_rate, growth_rate,
kernel_size=3, stride=1, padding=1, bias=False)),
self.drop_rate = drop_rate
def forward(self, x):
new_features = super(_DenseLayer, self).forward(x)
if self.drop_rate > 0:
new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
return torch.cat([x, new_features], 1)
The _transition layer is
class _Transition(nn.Sequential):
def __init__(self, num_input_features,
num_output_features,
bn_method = 'InPlaceABN'):
super(_Transition, self).__init__()
if bn_method == 'InPlaceABN':
self.add_module('norm_relu', ABN.InPlaceABN(num_input_features, activation='relu'))
elif bn_method == 'InPlaceABNSync':
self.add_module('norm_relu_sync', ABN.InPlaceABNSync(num_input_features,
activation = 'relu'))
elif bn_method == 'ABN':
self.add_module('abn', ABN.ABN(num_input_features))
else:
self.add_module('norm', nn.BatchNorm2d(num_input_features))
self.add_module('relu', nn.ReLU(inplace=True))
self.add_module('conv', nn.Conv2d(num_input_features, num_output_features,
kernel_size=1, stride=1, bias=False))
self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))
@mingminzhen You cannot use InPlaceABN with activation="relu" because on one hand it's not implemented and on the other hand it's not invertible and this is a requirement (see arxiv paper for technical details). You can use "leaky_relu" or "elu". Please note that we have recently added also the densenet model (https://github.com/mapillary/inplace_abn/blob/master/models/densenet.py).
It's helpful. thanks.
@rotabulo It would be great if you could share me the ResNet V2 with InplaceABN. In fact, I do not find memory savings with InplaceABN as I have to give up some original inplace operations.
As the basic block relies highly on the
out += residual
replacing it with the below operation is indeed expensive !
out = out + residual
Or how about replacing part of the BN with InplaceABN and remains the BN that connects to the original inplace operations?
Have you ever conducted experiments mixing up the basic BN and the InplaceABN?
@PkuRainBow Unfortunately we do not have a pre-trained ResNet v2. However, it is very easy to obtain the correct network structure for a ResNet v2 using our code, as it is a special case of ResNeXt, for which we provide an implementation. In particular, you need to specify groups=1
and base_channels=(64, 64, 256)
in the constructor, e.g. to obtain a ResNet152 v2 with InPlaceABN you can use:
from models.resnext import ResNeXt
from modules import InPlaceABN
res_net152 = ResNeXt([3, 8, 36, 3],
norm_act=InPlaceABN,
groups=1,
base_channels=(64, 64, 256))
@ducksoup @rotabulo Thanks for your great reply! Besides, I am wondering about the other solutions to avoid the memory cost of the out = out + residual
in the Bottle-Neck-Block in ResNet-V1(I do not want to replace it with ResNet-v2 currently due to some reasons.). As I find your paper report that you can save about 50% memory cost. But I have replaced all the BN with InplaceABNSync and change the out += residual
to out = out + residual
while keeping all the other staff same. But I find that no memory is saved, which is very strange. Could you help me?
@rotabulo @ducksoup Could you show your ASPP code which is for deeplabV3? I try to implement it myself, but there is still error "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation" In fact, the previous layer is conv layer. So there are no two subsequent inplace layers. It's weird. Could you help me?
class ASPP_ConvLayer(nn.Sequential):
def __init__(self, input_channels,
output_channels,
norm_act,
kernelSize = 3,
Stride = 1,
droprate = 0,
padding = 0,
dilation = 1):
super(ASPP_ConvLayer,self).__init__()
# self.add_module('norm1',nn.BatchNorm2d(input_channels))
# self.add_module('relu', nn.ReLU(inplace=True))
self.add_module('abn', norm_act(input_channels))
self.add_module('conv3',nn.Conv2d(input_channels,
output_channels,
kernel_size = kernelSize,
stride = 1,
padding = padding,
dilation = dilation,
bias = False))
self.drop_rate = droprate
def forward(self,x):
new_features=super(ASPP_ConvLayer,self).forward(x)
return new_features
class ASPP_Module(nn.Module):
def __init__(self,input_channels,
output_channels,
norm_act):
super(ASPP_Module, self).__init__()
dilation=[1,6,12,18]
self.conv2d_list = nn.Sequential()
#1x1 conv
self.conv2d_list.add_module('aspp1', ASPP_ConvLayer(input_channels,input_channels,
kernelSize=1,Stride=1, droprate=0.2,
padding =0, dilation = dilation[0],
norm_act =norm_act))
#3x3 conv rate=6
self.conv2d_list.add_module('aspp2', ASPP_ConvLayer(input_channels,input_channels,
kernelSize=3,Stride=1, droprate=0.2,
padding =dilation[1],
dilation = dilation[1],
norm_act =norm_act))
#3x3 conv rate=12
self.conv2d_list.add_module('aspp3', ASPP_ConvLayer(input_channels,input_channels,
kernelSize=3,Stride=1, droprate=0.2,
padding =dilation[2],
dilation = dilation[2],
norm_act =norm_act))
#3x3 conv rate=18
self.conv2d_list.add_module('aspp4', ASPP_ConvLayer(input_channels,input_channels,
kernelSize=3,Stride=1, droprate=0.2,
padding =dilation[3],
dilation = dilation[3],
norm_act =norm_act))
self.output_layer=nn.Sequential()
# self.output_layer.add_module('norm', nn.BatchNorm2d(input_channels*5))
# self.output_layer.add_module('relu', nn.ReLU(inplace=True))
# self.add_module('abn', norm_act(input_channels))
self.output_layer.add_module('abn', norm_act(input_channels))
self.output_layer.add_module('conv',nn.Conv2d(input_channels*5,
output_channels,
kernel_size = 1,
stride = 1,
padding = 0,
dilation = 1,
bias = False))
def forward(self, x):
# y=self.compress_layer(x)
y1 = self.conv2d_list.aspp1(x)
y2 = self.conv2d_list.aspp2(x)
y3 = self.conv2d_list.aspp3(x)
y4 = self.conv2d_list.aspp4(x)
final_out = torch.cat([x,y1,y2,y3,y4],1)
final_out = self.output_layer(final_out)
return final_out
@mingminzhen I don't see evident issues in the code snipped you provided. The only potential source of the error might come from x
given in input to the forward
function. Indeed, the first operation that you apply to x
is norm_act
. If the latter was an inplace operation (e.g. InplaceABN) and x
was generated by an inplace operation, then you would get a runtime error. Can you check what's the last operation used to generate x
?
@PkuRainBow how do you check the memory savings? Using just nvidia-smi
is not the correct way, because PyTorch uses an internal cache, which keeps allocated buffers also if they are not actually used.
@rotabulo Is it possible that InplaceABN will inplace x at aspp1
part. Then for aspp2(x), the x is not the original input, it's in fact y1?
@mingminzhen Now I see, it's applied in parallel. Yes this is definitely the issue. The aspp1 modifies x
, and then aspp2 issues a second inplace operation on x
. You should simply apply inplaceABN once to x
in ASPP_Module
, and not multiple times in each ASPP_ConvLayer
.
@rotabulo So how should I check the memory saving? Do you mean I can increase the batch size even with the same number shown by the nvidia-smi?
@PkuRainBow exactly.. just increase the batch size until it issues an out of memory exception. And compare how far you get using ABN (standard setting) vs InPlaceABN.
@rotabulo If the issue is the parallel structure, then for densenet (torch.cat([f(x),x])
)or resnet(f(x)+x
), is the secondx
not the original input?
It's similar in your DenseModule.
class DenseModule(nn.Module):
def __init__(self, in_channels, growth, layers, bottleneck_factor=4, norm_act=ABN, dilation=1):
super(DenseModule, self).__init__()
self.in_channels = in_channels
self.growth = growth
self.layers = layers
self.convs1 = nn.ModuleList()
self.convs3 = nn.ModuleList()
for i in range(self.layers):
self.convs1.append(nn.Sequential(OrderedDict([
("bn", norm_act(in_channels)),
("conv", nn.Conv2d(in_channels, self.growth * bottleneck_factor, 1, bias=False))
])))
self.convs3.append(nn.Sequential(OrderedDict([
("bn", norm_act(self.growth * bottleneck_factor)),
("conv", nn.Conv2d(self.growth * bottleneck_factor, self.growth, 3, padding=dilation, bias=False,
dilation=dilation))
])))
in_channels += self.growth
@property
def out_channels(self):
return self.in_channels + self.growth * self.layers
def forward(self, x):
inputs = [x]
for i in range(self.layers):
x = torch.cat(inputs, dim=1)
x = self.convs1[i](x)
x = self.convs3[i](x)
inputs += [x]
return torch.cat(inputs, dim=1)
Does torch.cat
concatenate and copy the inputs? Then the original one is unchanged.
So if I want to implement the ASPP structure. I need to copy the input firstly. Right?
I see. In your IdentityResidualBlock
, you use shortcut = x.clone()
.
@mingminzhen The issue is not the parallel structure per se, but the wrong way of using InPlaceABN in the parallel structure. torch.cat
is not an in-place operation and we use x.clone()
in 'IdentityResidualBlock' to prevent from having two consecutive inplace operations (i.e. add_
and InplaceABN
).
@rotabulo thanks.
@rotabulo Thanks! Besides, I am wondering the difference between ABN and inPlaceSyncABN, the latter one is expected to save more memory or just support multi-gpu?
@PkuRainBow
ABN
is standard BN + activation (no memory savings).
InPlaceABN
is BN+activation done inplace (with memory savings).
InPlaceABNSync
is BN+activation done inplace (with memory savings) + computation of BN (fwd+bwd) with data from all the gpus.
@rotabulo I test the InplaceABN on PyTorch v3 (torch-0.3.0.post4-cp35-cp35m-linux_x86_64.whl). It seems there is no error. Is the bug solved as you mentioned in the readme?
@mingminzhen We know that the issue has been solved in master
, and from the pytorch issue thread they write that they have not fixed it in v0.3. However, the bug occurs apparently only with python 2.7 and not with python 3.x. If you are in the latter case, then the code should run without issues.
@rotabulo Another question is about semantic segmentation. Do your pre-train the mdoel on MScoco or other data for the Cityscape dataset?
@rotabulo I also got the the similar bug ....and replace the pytorch v0.2....But v0,2 is much slower than v0.3 .......
@PkuRainBow this is unfortunately out of our control. Pytorch v0.3 with python 2.7 is buggy. Pytorch v0.3 with python 3.x should work. Also master with python 2.7 should work. Maybe you can give a try.
I think his implementation of ASPP is missing global average pooling. Am I right? Btw, I did not find your implementation of Deeplabv3 in this probject. Where is it? Thanks
@mingminzhen How you sloved it? Can you tell me?
For anyone who also has this problem: In Resnet / ResNeXt / Se-Resnet changing:
out = self.bn3(out)
out += residual
to
out = self.bn3(out) + residual
fixes the problem and don't give any overhead. While changing out += residual
to out = out + residual
gives a huge speed penalty.
I try to use the ABN, InPlaceABN, InPlaceABNSync. But some errors occur.
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
I test it on Pytorch-0.2, cudnnv7, cuda8.