RuntimeError: Cuda error: k_copy_4d: invalid device function.

Hi, I am going to train the net, but I get the following error: I am using CUDA7.5, Tesla K40 GPU, Debian.
$/majid/work/retina-unet$ python run_training.py
1. Create directory for the results (if not already existing)
Dir already existing
copy the configuration file in the results folder

2. Run the training on GPU (no nohup)
Using Theano backend.
Using gpu device 0: Tesla K40m (CNMeM is disabled, cuDNN not available)

train images/masks shape:
(20, 1, 565, 565)
train images range (min-max): 0.0 - 1.0
train masks are within 0-1

patches per full image: 9500

train PATCHES images/masks shape:
(190000, 1, 48, 48)
train PATCHES images range (min-max): 0.0 - 1.0
Traceback (most recent call last):
  File "./src/retinaNN_training.py", line 176, in <module>
    model = get_unet(n_ch, patch_height, patch_width)  #the U-net model
  File "./src/retinaNN_training.py", line 34, in get_unet
    conv1 = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(inputs)
  File "/home/azimi/.local/lib/python2.7/site-packages/keras/engine/topology.py", line 569, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/home/azimi/.local/lib/python2.7/site-packages/keras/engine/topology.py", line 632, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/home/azimi/.local/lib/python2.7/site-packages/keras/engine/topology.py", line 164, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/home/azimi/.local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 445, in call
    filter_shape=self.W_shape)
  File "/home/azimi/.local/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 1482, in conv2d
    np_kernel = kernel.eval()
  File "/home/azimi/.local/lib/python2.7/site-packages/theano/gof/graph.py", line 519, in eval
    rval = self._fn_cache[inputs](*args)
  File "/home/azimi/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 886, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/azimi/.local/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/azimi/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 873, in __call__
    self.fn() if output_subset is None else\
RuntimeError: Cuda error: k_copy_4d: invalid device function.
Apply node that caused the error: HostFromGpu(GpuDimShuffle{3,2,0,1}.0)
Toposort index: 1
Inputs types: [CudaNdarrayType(float32, 4D)]
Inputs shapes: [(32, 48, 3, 3)]
Inputs strides: [(1, 32, 4608, 1536)]
Inputs values: ['not shown']
Outputs clients: [['output']]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
orobix / retina-unet

RuntimeError: Cuda error: k_copy_4d: invalid device function. #15