CUDA error in DAG training

arunmallya commented 8 years ago

Similar to https://github.com/vlfeat/matconvnet/issues/325, but in vl_nnpool layer

Error using vl_nnpool
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED

Error in dagnn.Pooling/backward (line 18)
      derInputs{1} = vl_nnpool(inputs{1}, self.poolSize, derOutputs{1}, ...

Error in dagnn.Layer/backwardAdvanced (line 119)
      [derInputs, derParams] = obj.backward ...

Error in dagnn.DagNN/eval (line 99)
  obj.layers(l).block.backwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>process_epoch (line 186)
      net.eval(inputs, opts.derOutputs) ;

Error in cnn_train_dag (line 84)
    stats.train(epoch) = process_epoch(net, state, opts, 'train') ;

This error occurs only after a certain number of iterations, always at iteration 30 irrespective of random seed used - this is highly strange and interesting.

train: epoch 01:  28/1718: 2.8 Hz loss: 62.145
train: epoch 01:  29/1718: 2.8 Hz loss: 60.642
train: epoch 01:  30/1718: 2.8 Hz loss: 59.364
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED

I examined the inputs to the network at this iteration and they seem perfectly fine, thus, it's highly unlikely that the error is due to data corruption/invalid data. The GPU usage hovers around 5.5GB/ 11.5GB on a K-40 with cuda 6.5.

Any clues as to what's going wrong?

vedaldi commented 8 years ago

Hi, I would be very interested in debugging this error.

Are you using cuDNN by any chance?
Could we know the exact training setting and data to try to reproduce it?

Andrea

On 27 Nov 2015, at 22:17, Arun Mallya notifications@github.com wrote:

Similar to #325 https://github.com/vlfeat/matconvnet/issues/325, but in vl_nnpool layer

Error using vl_nnpool An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_LAUNCH_FAILED

Error in dagnn.Pooling/backward (line 18) derInputs{1} = vl_nnpool(inputs{1}, self.poolSize, derOutputs{1}, ...

Error in dagnn.Layer/backwardAdvanced (line 119) [derInputs, derParams] = obj.backward ...

Error in dagnn.DagNN/eval (line 99) obj.layers(l).block.backwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>process_epoch (line 186) net.eval(inputs, opts.derOutputs) ;

Error in cnn_train_dag (line 84) stats.train(epoch) = process_epoch(net, state, opts, 'train') ; This error occurs only after a certain number of iterations, always at iteration 30 irrespective of random seed used - this is highly strange and interesting.

train: epoch 01: 28/1718: 2.8 Hz loss: 62.145 train: epoch 01: 29/1718: 2.8 Hz loss: 60.642 train: epoch 01: 30/1718: 2.8 Hz loss: 59.364 Warning: An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_LAUNCH_FAILED I examined the inputs to the network at this iteration and they seem perfectly fine, thus, it's highly unlikely that the error is due to data corruption/invalid data. The GPU usage hovers around 5.5GB/ 11.5GB on a K-40 with cuda 6.5.

Any clues as to what's going wrong?

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/332.

arunmallya commented 8 years ago

No, I'm not using cuDNN

I can share the model I am using but the exact training setting might be hard to reproduce since I just load inputs from images in a folder (~7GB in size). I will try to reproduce the issue with smaller data and send that over if possible.

vedaldi commented 8 years ago

Hi, thanks. You might get away by just generating matrices of zeros in getBatch (hence without any need to sharing images).

On 30 Nov 2015, at 01:03, Arun Mallya notifications@github.com wrote:

No, I'm not using cuDNN

I can share the model I am using but the exact training setting might be hard to reproduce since I just load inputs from images in a folder (~7GB in size). I will try to reproduce the issue with smaller data and send that over if possible.

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/332#issuecomment-160490754.

arunmallya commented 8 years ago

You're absolutely right, using zero matrices also causes the error on iter 30! Here's the code: https://gist.github.com/arunmallya/c7b6c6cafa6252172727 And here's the imdb: https://www.dropbox.com/s/xapokyw8iinbcyl/imdbHICO.mat?dl=0

I really hope it isn't some silly mistake on my part causing the error :)

vlfeat / matconvnet

CUDA error in DAG training #332