cudnn error? - Githubissues

sanyam5 commented 6 years ago

Hey I am getting this error when I try to train.

python -m torchst.train --config train.yml

0it [00:00, ?it/s]/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/model.py:308: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().
/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/model.py:306: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().

Traceback (most recent call last):
  File "/home/sanyam/miniconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sanyam/miniconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/train.py", line 708, in <module>
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/train.py", line 702, in main
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/train.py", line 410, in train
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/train.py", line 386, in step_train
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/train.py", line 356, in step
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/train.py", line 348, in forward
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 105, in data_parallel
    outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
    raise output
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
    output = module(*input, **kwargs)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/train.py", line 485, in forward
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/model.py", line 401, in forward
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/model.py", line 372, in _decode
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/pytorch_skipthoughts-0.4.4-py3.6.egg/torchst/model.py", line 306, in _run_rnn_packed
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 162, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 351, in forward
    return func(input, *fargs, **fkwargs)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 284, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 306, in forward
    result = self.forward_extended(*nested_tensors)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 235, in forward
    fn.rnn_desc = init_rnn_descriptor(fn, handle)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 42, in init_rnn_descriptor
    cudnn.DropoutDescriptor(handle, dropout_p, fn.dropout_seed)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 184, in __init__
    self._set(dropout, seed)
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 209, in _set
    ctypes.c_ulonglong(seed),
  File "/home/sanyam/miniconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 255, in check_error
    raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'

train.yml

name: skipthoughts
data-path: ./bookscorpus_small.txt
vocab-path: ./vocab.pkl
save-dir: ./saver/
gpus: [0, 1, 2, 3]
previews: 10
wordembed-type: none
wordembed-path: none
fasttext-path: null
wordembed-freeze: false
epochs: 10
batch-size: 256
omit-prob: 0.05
swap-prob: 0.05
val-period: 100
save-period: 1000
max-len: 30
visdom-host: localhost
visdom-port: 8097
visdom-buffer-size: 10
encoder-cell: lstm
decoder-cell: gru
before: 1
after: 1
predict-self: false
word-dim: 300
hidden-dim: 2400
layers: 1
# bidirectional: true
dropout-prob: 0.05

Do you know why this could be happening?

sanyam5 commented 6 years ago

It turns out that not all 4 of my GPU's were free. Removed the occupied GPU's from train.yml and it's working.

However I noticed that you need to include GPU #0 otherwise pytorch reports that the variables are on different GPU's. I suspect that PyTorch puts some variables on GPU #0 by default.

kaniblu commented 6 years ago

You are correct. I have made some fixes since then, but I haven't push it to the repo yet. I will hopefully do so soon.

kaniblu / pytorch-skipthoughts

cudnn error? #1