keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.61k stars 19.42k forks source link

keras/examples/variational_autoencoder_deconv.py fails to run with Theano but working fine with Tensorflow #4259

Closed indraforyou closed 7 years ago

indraforyou commented 7 years ago

keras/examples/variational_autoencoder_deconv.py fails to run with Theano but working fine with Tensorflow.

I have the latest version .. also it was not working on older version. In [2]: keras.version Out[2]: '1.1.1' In [4]: theano.version Out[4]: '0.9.0dev1.dev-5e50147375ad507990655cc1a3e990aa4c190549'

On my PC which has Quadro GPU I am getting error:

` Train on 60000 samples, validate on 10000 samples Epoch 1/5 Traceback (most recent call last): File "variational_autoencoder_deconv.py", line 133, in validation_data=(x_test, x_test)) File "/home/isur2/0.Work/DEEP_LEARNING/keras/keras/engine/training.py", line 1124, in fit callback_metrics=callback_metrics) File "/home/isur2/0.Work/DEEP_LEARNING/keras/keras/engine/training.py", line 842, in _fit_loop outs = f(ins_batch) File "/home/isur2/0.Work/DEEP_LEARNING/keras/keras/backend/theano_backend.py", line 792, in call return self.function(*inputs) File "/home/isur2/Dropbox (ASU)/DEEP_LEARNING/Theano/theano/compile/function_module.py", line 886, in call storage_map=getattr(self.fn, 'storage_map', None)) File "/home/isur2/Dropbox (ASU)/DEEP_LEARNING/Theano/theano/gof/link.py", line 325, in raise_with_op reraise(exc_type, exc_value, exc_trace) File "/home/isur2/Dropbox (ASU)/DEEP_LEARNING/Theano/theano/compile/function_module.py", line 873, in call self.fn() if output_subset is None else\ RuntimeError: GpuDnnConvGradI: error getting worksize: CUDNN_STATUS_BAD_PARAM Apply node that caused the error: GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}) Toposort index: 332 Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D), <theano.gof.type.CDataType object at 0x7f9459ff1390>, Scalar(float32), Scalar(float32)] Inputs shapes: [(64, 64, 3, 3), (100, 64, 14, 14), (100, 64, 14, 64), 'No shapes', (), ()] Inputs strides: [(576, 9, 3, 1), (12544, 196, 14, 1), (57344, 896, 64, 1), 'No strides', (), ()] Inputs values: ['not shown', 'not shown', 'not shown', <capsule object NULL at 0x7f945079a540>, 1.0, 0.0] Inputs name: ('kernel', 'grad', 'output', 'descriptor', 'alpha', 'beta')

Outputs clients: [[GpuDimShuffle{0,2,3,1}(GpuDnnConvGradI{algo='none', inplace=True}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'. HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node. `

On our cluster which has Tesla GPU I am getting error:

`Train on 60000 samples, validate on 10000 samples Epoch 1/5 Traceback (most recent call last): File "variational_autoencoder_deconv.py", line 133, in validation_data=(x_test, x_test)) File "/home/isur2/.python_packages/keras/engine/training.py", line 1124, in fit callback_metrics=callback_metrics) File "/home/isur2/.python_packages/keras/engine/training.py", line 842, in _fit_loop outs = f(ins_batch) File "/home/isur2/.python_packages/keras/backend/theano_backend.py", line 792, in call return self.function(*inputs) File "/home/isur2/.python_packages/theano/compile/function_module.py", line 871, in call storage_map=getattr(self.fn, 'storage_map', None)) File "/home/isur2/.python_packages/theano/gof/link.py", line 314, in raise_with_op reraise(exc_type, exc_value, exc_trace) File "/home/isur2/.python_packages/theano/compile/function_module.py", line 859, in call outputs = self.fn() ValueError: GpuCorrMM shape inconsistency: bottom shape: 100 64 29 64 weight shape: 64 64 2 2 top shape: 100 64 14 14 (expected 100 64 14 32)

Apply node that caused the error: GpuCorrMM_gradInputs{valid, (2, 2)}(GpuContiguous.0, GpuContiguous.0, TensorConstant{29}, TensorConstant{64}) Toposort index: 255 Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D), TensorType(int64, scalar), TensorType(int64, scalar)] Inputs shapes: [(64, 64, 2, 2), (100, 64, 14, 14), (), ()] Inputs strides: [(256, 4, 2, 1), (12544, 196, 14, 1), (), ()] Inputs values: ['not shown', 'not shown', array(29), array(64)] Outputs clients: [[GpuDimShuffle{0,2,3,1}(GpuCorrMM_gradInputs{valid, (2, 2)}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'. HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node. `

Tensorflow is not quiet working on our cluster and I need to fall back on Theano which has this problem. Note: keras/examples/variational_autoencoder.py example is working fine with theano indicating the problem is with convolution or deconvolution layers

Regards, Indranil

indraforyou commented 7 years ago

Update:

My keras config was having : "image_dim_ordering": "tf"

Changing this to "image_dim_ordering": "th", the example is working fine. (on both Quadro and Tesla GPU)

But the same can't be said with my own changed code. -- On Tesla GPU its working fine -- On Quadro GPU its giving similar error: File "/home/isur2/0.Work/DEEP_LEARNING/keras/keras/engine/training.py", line 1124, in fit callback_metrics=callback_metrics) File "/home/isur2/0.Work/DEEP_LEARNING/keras/keras/engine/training.py", line 842, in _fit_loop outs = f(ins_batch) File "/home/isur2/0.Work/DEEP_LEARNING/keras/keras/backend/theano_backend.py", line 792, in __call__ return self.function(*inputs) File "/home/isur2/Dropbox (ASU)/DEEP_LEARNING/Theano/theano/compile/function_module.py", line 886, in __call__ storage_map=getattr(self.fn, 'storage_map', None)) File "/home/isur2/Dropbox (ASU)/DEEP_LEARNING/Theano/theano/gof/link.py", line 325, in raise_with_op reraise(exc_type, exc_value, exc_trace) File "/home/isur2/Dropbox (ASU)/DEEP_LEARNING/Theano/theano/compile/function_module.py", line 873, in __call__ self.fn() if output_subset is None else\ **ValueError: GpuDnnConv images and kernel must have the same stack size**

In any case with config "image_dim_ordering": "tf", convolution layers are not working with Theano. For Tensorflow both config seems to be working..

Regards, Indranil

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.