gpleiss / efficient_densenet_pytorch

A memory-efficient implementation of DenseNets
MIT License
1.52k stars 327 forks source link

Segmentation fault (core dumped) error for multiple GPUs #47

Open theonegis opened 5 years ago

theonegis commented 5 years ago

Environment:

class _DenseLayer(nn.Module): def init(self, num_input_features, growth_rate, bn_size, drop_rate): super(_DenseLayer, self).init() self.add_module('conv1', nn.Conv2d(num_input_features, bn_size growth_rate, 1)) self.add_module('relu1', nn.ReLU(inplace=True)) self.add_module('conv2', nn.Conv2d(bn_size growth_rate, growth_rate, 3, padding=1)) self.add_module('relu2', nn.ReLU(inplace=True)) self.drop_rate = drop_rate

def forward(self, *inputs):
    cat_function = _cat_function_factory(self.conv1, self.relu1)
    if any(feature.requires_grad for feature in inputs):
        output = cp.checkpoint(cat_function, *inputs)
    else:
        output = cat_function(*inputs)
    new_features = self.relu2(self.conv2(output))
    if self.drop_rate > 0:
        new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
    return new_features

class _DenseBlock(nn.Module): def init(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate): super(_DenseBlock, self).init() for i in range(num_layers): layer = _DenseLayer(num_input_features + i * growth_rate, growth_rate, bn_size, drop_rate) self.add_module(f'denselayer{i + 1}', layer)

def forward(self, init_features):
    features = [init_features]
    for name, layer in self.named_children():
        new_features = layer(*features)
        features.append(new_features)
    return torch.cat(features, 1)

It can run on single GPU, but it throws a Segmentation fault (core dumped) error when running on multiple GPUS. What can be caused this issues?
theonegis commented 5 years ago

I noticed that there was a similar issue in PyTorch repository Segfault in dataparallel + checkpoint #11732. It seems that it has not been fixed yet.

Ushk commented 5 years ago

@theonegis - I raised the original issue. Just to check whether they are similar problems, can you copy the faulthandler output here, to see if also points to cp.checkpoint being the issue?

import faulthandler faulthandler.enable()

at the beginning of your code should output a traceback when your code segfaults.

(Apologies to the PyTorch devs if this is not helpful, I'm just curious)

theonegis commented 5 years ago

@Ushk

Fatal Python error: Segmentation fault

Thread 0x00007f9894e36700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 144 in <listcomp>
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 144 in _flatten_dense_tensors
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 119 in <listcomp>
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 119 in reduce_add_coalesced
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f9855e18700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 252 in _take_tensors
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 118 in reduce_add_coalesced
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f9897637700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/nccl.py", line 14 in is_available
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 76 in reduce_add
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 120 in reduce_add_coalesced
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Current thread 0x00007f989ca35700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f9897e38700 (most recent call first):

Thread 0x00007f990ceb0740 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 93 in backward
  File "/home/theonegis/Developer/DenseNet/experiment.py", line 34 in train_on_epoch
  File "/home/theonegis/Developer/DenseNet/experiment.py", line 100 in _train
  File "/home/theonegis/Developer/DenseNet/experiment.py", line 141 in train
  File "run.py", line 57 in <module>
gpleiss commented 5 years ago

@theonegis what happens if you upgrade to the latest stable version of PyTorch (0.4.1)?

theonegis commented 5 years ago

@gpleiss Still the same problem.

Ushk commented 5 years ago

Yeah, just an FYI, I'm on 0.4.1 as well. And can see that yours is also a checkpoint issue. What happens if you checkpoint -all- of your layers @theonegis?