Open theonegis opened 5 years ago
I noticed that there was a similar issue in PyTorch repository Segfault in dataparallel + checkpoint #11732. It seems that it has not been fixed yet.
@theonegis - I raised the original issue. Just to check whether they are similar problems, can you copy the faulthandler output here, to see if also points to cp.checkpoint being the issue?
import faulthandler
faulthandler.enable()
at the beginning of your code should output a traceback when your code segfaults.
(Apologies to the PyTorch devs if this is not helpful, I'm just curious)
@Ushk
Fatal Python error: Segmentation fault
Thread 0x00007f9894e36700 (most recent call first):
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 144 in <listcomp>
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 144 in _flatten_dense_tensors
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 119 in <listcomp>
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 119 in reduce_add_coalesced
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
Thread 0x00007f9855e18700 (most recent call first):
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 252 in _take_tensors
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 118 in reduce_add_coalesced
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
Thread 0x00007f9897637700 (most recent call first):
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/nccl.py", line 14 in is_available
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 76 in reduce_add
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 120 in reduce_add_coalesced
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
Current thread 0x00007f989ca35700 (most recent call first):
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
Thread 0x00007f9897e38700 (most recent call first):
Thread 0x00007f990ceb0740 (most recent call first):
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 93 in backward
File "/home/theonegis/Developer/DenseNet/experiment.py", line 34 in train_on_epoch
File "/home/theonegis/Developer/DenseNet/experiment.py", line 100 in _train
File "/home/theonegis/Developer/DenseNet/experiment.py", line 141 in train
File "run.py", line 57 in <module>
@theonegis what happens if you upgrade to the latest stable version of PyTorch (0.4.1)?
@gpleiss Still the same problem.
Yeah, just an FYI, I'm on 0.4.1 as well. And can see that yours is also a checkpoint issue. What happens if you checkpoint -all- of your layers @theonegis?
Environment:
class _DenseLayer(nn.Module): def init(self, num_input_features, growth_rate, bn_size, drop_rate): super(_DenseLayer, self).init() self.add_module('conv1', nn.Conv2d(num_input_features, bn_size growth_rate, 1)) self.add_module('relu1', nn.ReLU(inplace=True)) self.add_module('conv2', nn.Conv2d(bn_size growth_rate, growth_rate, 3, padding=1)) self.add_module('relu2', nn.ReLU(inplace=True)) self.drop_rate = drop_rate
class _DenseBlock(nn.Module): def init(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate): super(_DenseBlock, self).init() for i in range(num_layers): layer = _DenseLayer(num_input_features + i * growth_rate, growth_rate, bn_size, drop_rate) self.add_module(f'denselayer{i + 1}', layer)