Training with nn.GRU on multi-gpu causes CUDNN_STATUS_EXECUTION_FAILED

hwchong commented 7 years ago

I've been trying to train recurrent neural network using nn.GRU on a multi-GPU setup and am randomly getting crashes caused by CuDNN.

I'm using PyTorch 0.2 on Linux running in an nvidia-docker container with the latest nvidia/cuda image.

This is the error message that comes up:

CuDNNError Traceback (most recent call last)

in () 7 optimizer.zero_grad() 8 model.module.hidden = model.module.init_hidden() ----> 9 outputs = model(inputs) 10 outputs = outputs.view((-1)) 11 loss = criterion(outputs, labels) /usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 222 for hook in self._forward_pre_hooks.values(): 223 hook(self, input) --> 224 result = self.forward(*input, **kwargs) 225 for hook in self._forward_hooks.values(): 226 hook_result = hook(self, input, result) /usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs) 58 return self.module(*inputs[0], **kwargs[0]) 59 replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) ---> 60 outputs = self.parallel_apply(replicas, inputs, kwargs) 61 return self.gather(outputs, self.output_device) 62 /usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs) 68 69 def parallel_apply(self, replicas, inputs, kwargs): ---> 70 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) 71 72 def gather(self, outputs, output_device): /usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices) 65 output = results[i] 66 if isinstance(output, Exception): ---> 67 raise output 68 outputs.append(output) 69 return outputs /usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py in _worker(i, module, input, kwargs, results, lock, device) 40 try: 41 with torch.cuda.device(device): ---> 42 output = module(*input, **kwargs) 43 with lock: 44 results[i] = output /usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 222 for hook in self._forward_pre_hooks.values(): 223 hook(self, input) --> 224 result = self.forward(*input, **kwargs) 225 for hook in self._forward_hooks.values(): 226 hook_result = hook(self, input, result) in forward(self, input) 15 def forward(self, input): 16 #self.hidden = self.init_hidden() ---> 17 gru_out, self.hidden = self.gru(input, self.hidden) 18 x = gru_out[:, -1] 19 x = self.dropout(x) /usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 222 for hook in self._forward_pre_hooks.values(): 223 hook(self, input) --> 224 result = self.forward(*input, **kwargs) 225 for hook in self._forward_hooks.values(): 226 hook_result = hook(self, input, result) /usr/local/lib/python3.5/dist-packages/torch/nn/modules/rnn.py in forward(self, input, hx) 160 flat_weight=flat_weight 161 ) --> 162 output, hidden = func(input, self.all_weights, hx) 163 if is_packed: 164 output = PackedSequence(output, batch_sizes) /usr/local/lib/python3.5/dist-packages/torch/nn/_functions/rnn.py in forward(input, *fargs, **fkwargs) 349 else: 350 func = AutogradRNN(*args, **kwargs) --> 351 return func(input, *fargs, **fkwargs) 352 353 return forward /usr/local/lib/python3.5/dist-packages/torch/autograd/function.py in _do_forward(self, *input) 282 self._nested_input = input 283 flat_input = tuple(_iter_variables(input)) --> 284 flat_output = super(NestedIOFunction, self)._do_forward(*flat_input) 285 nested_output = self._nested_output 286 nested_variables = _unflatten(flat_output, self._nested_output) /usr/local/lib/python3.5/dist-packages/torch/autograd/function.py in forward(self, *args) 304 def forward(self, *args): 305 nested_tensors = _map_variable_tensor(self._nested_input) --> 306 result = self.forward_extended(*nested_tensors) 307 del self._nested_input 308 self._nested_output = result /usr/local/lib/python3.5/dist-packages/torch/nn/_functions/rnn.py in forward_extended(self, input, weight, hx) 291 hy = tuple(h.new() for h in hx) 292 --> 293 cudnn.rnn.forward(self, input, hx, weight, output, hy) 294 295 self.save_for_backward(input, hx, weight, output) /usr/local/lib/python3.5/dist-packages/torch/backends/cudnn/rnn.py in forward(fn, input, hx, weight, output, hy) 303 fn.cy_desc, ctypes.c_void_p(cy.data_ptr()) if cx is not None else None, 304 ctypes.c_void_p(workspace.data_ptr()), workspace.size(0), --> 305 ctypes.c_void_p(fn.reserve.data_ptr()), fn.reserve.size(0) 306 )) 307 else: # inference /usr/local/lib/python3.5/dist-packages/torch/backends/cudnn/__init__.py in check_error(status) 253 def check_error(status): 254 if status is not 0: --> 255 raise CuDNNError(status) 256 257 CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'

santisy commented 7 years ago

I think the problem lies in the scatter_kwargs function. scatter_kwargs in data_parallel. It seems that scatter_kwargs only split the variables in the first dimension, however the hidden_state input to RNNs requires format (num_layers * num_directions, batch, hidden_size), which contradicts with the rule of scatter_kwargs. One temporary solution is to swap the dimension of hidden_state by wrapping GRU.

soumith commented 4 years ago

should be good by now

pytorch / pytorch

Training with nn.GRU on multi-gpu causes CUDNN_STATUS_EXECUTION_FAILED #2418