keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.57k stars 19.42k forks source link

ConvLSTM2D with Theano libgpuarray and cuDNN crashes #4305

Closed carlthome closed 7 years ago

carlthome commented 7 years ago

KeyError: ('The following error happened while compiling the node', forall_inplace,cpu,scan_fn}(Shape_i{1}.0, InplaceGpuDimShuffle{0,1,4,2,3}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, Shape_i{1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0), '\n', 'The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'cudnn_handle')

I assume this is related to having convolution operations inside a scan. @jeammimi, @nouiz, @fchollet, what do you think? Can you get the conv_lstm.py example working with THEANO_FLAGS=device=cuda and cuDNN?

nouiz commented 7 years ago

Make sure to use Theano dev version. If you where using it, update it again. I fixed a crash that could solve this problem recently.

If the problem persist, give the full error message.

On Sun, Nov 6, 2016 at 2:15 PM, Carl Thomé notifications@github.com wrote:

KeyError: ('The following error happened while compiling the node', forall_inplace,cpu,scan_fn}(Shape_i{1}.0, InplaceGpuDimShuffle{0,1,4,2,3}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, Shape_i{1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{3,2,0,1}.0, InplaceGpuDimShuffle{x,x,x,0}.0, InplaceGpuDimShuffle{3,2,0,1}.0), '\n', 'The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'cudnn_handle')

I assume this is related to having convolution operations inside a scan. @jeammimi https://github.com/jeammimi, @nouiz https://github.com/nouiz, @fchollet https://github.com/fchollet, what do you think? Can you get the conv_lstm.py example working with THEANO_FLAGS=device=cuda and cuDNN?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/4305, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-xFwY1tG7Beq2Fm_A_NYaZGJpbinks5q7ic4gaJpZM4KqpbU .

carlthome commented 7 years ago

Problem persists with Theano's latest unfortunately (installed from GitHub master just now). Please try python keras/examples/conv_lstm.py, same for you?

pip show theano:

Name: Theano
Version: 0.9.0.dev4
Summary: Optimizing compiler for evaluating mathematical expressions on CPUs and GPUs.
Home-page: http://deeplearning.net/software/theano/
Author: LISA laboratory, University of Montreal
Author-email: theano-dev@googlegroups.com
License: BSD
Location: /home/carl/anaconda3/lib/python3.5/site-packages
Requires: numpy, scipy, six

python conv_lstm.py:

Using Theano backend.
Mapped name None to device cuda: TITAN X (Pascal)
Using cuDNN version 5105 on context None
Traceback (most recent call last):
  File "conv_lstm.py", line 104, in <module>
    nb_epoch=300, validation_split=0.05)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/models.py", line 640, in fit
    sample_weight=sample_weight)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1084, in fit
    self._make_test_function()
  File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 739, in _make_test_function
    **self._function_kwargs)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 821, in function
    return Function(inputs, outputs, updates=updates, **kwargs)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 807, in __init__
    **kwargs)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/function.py", line 326, in function
    output_keys=output_keys)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/pfunc.py", line 486, in pfunc
    output_keys=output_keys)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 1784, in orig_function
    defaults)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 1651, in create
    input_storage=input_storage_lists, storage_map=storage_map)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/link.py", line 699, in make_thunk
    storage_map=storage_map)[:3]
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/vm.py", line 1057, in make_all
    impl=impl))
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 924, in make_thunk
    no_recycling)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 824, in make_c_thunk
    no_recycling=e_no_recycling)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/cc.py", line 563, in accept
    self.fetch_variables()
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/cc.py", line 589, in fetch_variables
    params = node.run_params()
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/graph.py", line 129, in run_params
    return self.op.get_params(self)
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gpuarray/dnn.py", line 215, in get_params
    ptr = get_prop(self.dnn_context(node), 'cudnn_handle').value
  File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gpuarray/type.py", line 114, in get_prop
    return _get_props(name)[k]
KeyError: ('The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'cudnn_handle')
nouiz commented 7 years ago

What Theano and Keras flag do you use?

On Mon, Nov 7, 2016 at 10:05 AM, Carl Thomé notifications@github.com wrote:

Problem persists with Theano's latest unfortunately (installed from GitHub master just now). Please try python keras/examples/conv_lstm.py, same for you?

pip show theano:

Name: Theano Version: 0.9.0.dev4 Summary: Optimizing compiler for evaluating mathematical expressions on CPUs and GPUs. Home-page: http://deeplearning.net/software/theano/ Author: LISA laboratory, University of Montreal Author-email: theano-dev@googlegroups.com License: BSD Location: /home/carl/anaconda3/lib/python3.5/site-packages Requires: numpy, scipy, six

python conv_lstm.py:

Using Theano backend. Mapped name None to device cuda: TITAN X (Pascal) Using cuDNN version 5105 on context None Traceback (most recent call last): File "conv_lstm.py", line 104, in nb_epoch=300, validation_split=0.05) File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/models.py", line 640, in fit sample_weight=sample_weight) File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1084, in fit self._make_test_function() File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 739, in _make_test_function _self._function_kwargs) File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 821, in function return Function(inputs, outputs, updates=updates, _kwargs) File "/home/carl/anaconda3/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 807, in init **kwargs) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/function.py", line 326, in function output_keys=output_keys) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/pfunc.py", line 486, in pfunc output_keys=output_keys) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 1784, in orig_function defaults) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 1651, in create input_storage=input_storage_lists, storage_map=storage_map) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/link.py", line 699, in make_thunk storage_map=storage_map)[:3] File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/vm.py", line 1057, in make_all impl=impl)) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 924, in make_thunk no_recycling) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 824, in make_c_thunk no_recycling=e_no_recycling) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/cc.py", line 563, in accept self.fetch_variables() File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/cc.py", line 589, in fetch_variables params = node.run_params() File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gof/graph.py", line 129, in run_params return self.op.get_params(self) File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gpuarray/dnn.py", line 215, in get_params ptr = get_prop(self.dnn_context(node), 'cudnn_handle').value File "/home/carl/anaconda3/lib/python3.5/site-packages/theano/gpuarray/type.py", line 114, in get_prop return _get_props(name)[k] KeyError: ('The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'cudnn_handle')

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/4305#issuecomment-258859712, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-7K2gEtXJqeJ9ROvSrupdVC8K1Rhks5q7z5HgaJpZM4KqpbU .

carlthome commented 7 years ago

Pretty standard:

[global]
device = cuda
floatX = float32
{
    "floatx": "float32",
    "backend": "theano",
    "image_dim_ordering": "th",
    "epsilon": 1e-07
}

Same error with image_dim_ordering='tf' as well.

carlthome commented 7 years ago

Related https://github.com/Theano/Theano/issues/5123

carlthome commented 7 years ago

Annoyingly, downgrading (e.g. pip install git+git://github.com/theano/theano@1dabf8540db2f5eda4a80c2798848c12836642c1) is the best fix I could come up with at the moment. :cry:

abergeron commented 7 years ago

I've tried to reproduce this, but I get a different error:

$ THEANO_FLAGS=floatX=float32,device=cuda0 python conv_lstm.py 
Using Theano backend.
Mapped name None to device cuda0: GeForce GTX 750 Ti
PCI Bus ID: 0000:07:00.0
Using cuDNN version 5105 on context None
/home/anakha/ext/keras/keras/layers/convolutional_recurrent.py:279: UserWarning: Be carefull if used with convolution3D layers:
th in convolution 3D corresponds to (samples, channels, conv_dim1, conv_dim2,conv_dim3)
while for this network it corresponds to: (samples, time, channels, rows, cols)
  warnings.warn('Be carefull if used with convolution3D layers:\n'
/home/anakha/ext/keras/keras/layers/convolutional_recurrent.py:279: UserWarning: Be carefull if used with convolution3D layers:
th in convolution 3D corresponds to (samples, channels, conv_dim1, conv_dim2,conv_dim3)
while for this network it corresponds to: (samples, time, channels, rows, cols)
  warnings.warn('Be carefull if used with convolution3D layers:\n'
Traceback (most recent call last):
  File "conv_lstm.py", line 104, in <module>
    nb_epoch=300, validation_split=0.05)
  File "/home/anakha/ext/keras/keras/models.py", line 642, in fit
    sample_weight=sample_weight)
  File "/home/anakha/ext/keras/keras/engine/training.py", line 1062, in fit
    batch_size=batch_size)
  File "/home/anakha/ext/keras/keras/engine/training.py", line 1000, in _standardize_user_data
    check_loss_and_target_compatibility(y, self.loss_functions, self.internal_output_shapes)
  File "/home/anakha/ext/keras/keras/engine/training.py", line 215, in check_loss_and_target_compatibility
    ' while using as loss `' + loss.__name__ + '`. '
Exception: A target array with shape (1000, 15, 40, 40, 1) was passed for an output of shape (None, None, 40, 40, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output
carlthome commented 7 years ago

@kilotaras, https://github.com/fchollet/keras/commit/6b04add93209f557e265bbd04ab34d5491d463f0 probably introduced @abergeron's problem with examples/conv_lstm.py

Shouldn't the shape checking ignore None? None is used for allowing variable lengths during shape inference.

kilotaras commented 7 years ago

@carlthome sorry my bad, I was checking wrong parameter for None. Will prepare PR with fix (and test) in a couple of minutes.

kilotaras commented 7 years ago

The fix is in #4458

abergeron commented 7 years ago

I tried with the current master of Theano, libgpuarray and keras (with the fix in #4458 on top) and I don't get an error.

carlthome commented 7 years ago

Curious, I did a reinstall of Theano, Keras and libgpuarray from their respective master branches just now. All tests with python -c "import pygpu; pygpu.test()" passes but I still get the same error when running conv_lstm.py:

KeyError: ('The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'cudnn_handle')
nouiz commented 7 years ago

What is your GPU? Part of the error tell the opposite, that it is old.

What is your GPU and cuda version? Maybe you need a more recent cuda/driver for your GPU.

On Mon, Nov 21, 2016 at 1:52 PM, Carl Thomé notifications@github.com wrote:

Curious, I did a reinstall of Theano, Keras and libgpuarray from their respective master branches just now. All tests with python -c "import pygpu; pygpu.test()" passes but I still get the same error when running conv_lstm.py:

KeyError: ('The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'cudnn_handle')

I think this might be related with that my particular graphics card is very recent. If I force cuDNN with THEANO_FLAGS, warnings and errors pop up:

RuntimeError: You enabled cuDNN, but we aren't able to use it: Can not compile with cuDNN. We got this error:b"nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\n/usr/bin/ld: /tmp/tmpxft_00007474_00000000-4_try_flags_4_lxnpcb.o: relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC\n/usr/bin/ld: final link failed: Nonrepresentable section on output\ncollect2: error: ld returned 1 exit status\n"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/4305#issuecomment-262031229, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-w3VqTG-YLoIWnY47AAbl-ir19UTks5rAeiFgaJpZM4KqpbU .

carlthome commented 7 years ago

Solved!

I'm on Ubuntu 16.04. For CUDA 7.5, if you recall, the gcc version had to be no newer than 4.9, so many of us used update-alternatives to lock versions. Now, with CUDA 8 nvcc wants a newer gcc so the solution was to configure update-alternatives properly.

One would also have to clear Theano's compile cache most likely (theano-cache purge).