vlfeat / matconvnet

MatConvNet: CNNs for MATLAB
Other
1.4k stars 753 forks source link

CUDA error in DAG training #332

Open arunmallya opened 8 years ago

arunmallya commented 8 years ago

Similar to https://github.com/vlfeat/matconvnet/issues/325, but in vl_nnpool layer

Error using vl_nnpool
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED

Error in dagnn.Pooling/backward (line 18)
      derInputs{1} = vl_nnpool(inputs{1}, self.poolSize, derOutputs{1}, ...

Error in dagnn.Layer/backwardAdvanced (line 119)
      [derInputs, derParams] = obj.backward ...

Error in dagnn.DagNN/eval (line 99)
  obj.layers(l).block.backwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>process_epoch (line 186)
      net.eval(inputs, opts.derOutputs) ;

Error in cnn_train_dag (line 84)
    stats.train(epoch) = process_epoch(net, state, opts, 'train') ;

This error occurs only after a certain number of iterations, always at iteration 30 irrespective of random seed used - this is highly strange and interesting.

train: epoch 01:  28/1718: 2.8 Hz loss: 62.145
train: epoch 01:  29/1718: 2.8 Hz loss: 60.642
train: epoch 01:  30/1718: 2.8 Hz loss: 59.364
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED 

I examined the inputs to the network at this iteration and they seem perfectly fine, thus, it's highly unlikely that the error is due to data corruption/invalid data. The GPU usage hovers around 5.5GB/ 11.5GB on a K-40 with cuda 6.5.

Any clues as to what's going wrong?

vedaldi commented 8 years ago

Hi, I would be very interested in debugging this error.

Andrea

On 27 Nov 2015, at 22:17, Arun Mallya notifications@github.com wrote:

Similar to #325 https://github.com/vlfeat/matconvnet/issues/325, but in vl_nnpool layer

Error using vl_nnpool An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_LAUNCH_FAILED

Error in dagnn.Pooling/backward (line 18) derInputs{1} = vl_nnpool(inputs{1}, self.poolSize, derOutputs{1}, ...

Error in dagnn.Layer/backwardAdvanced (line 119) [derInputs, derParams] = obj.backward ...

Error in dagnn.DagNN/eval (line 99) obj.layers(l).block.backwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>process_epoch (line 186) net.eval(inputs, opts.derOutputs) ;

Error in cnn_train_dag (line 84) stats.train(epoch) = process_epoch(net, state, opts, 'train') ; This error occurs only after a certain number of iterations, always at iteration 30 irrespective of random seed used - this is highly strange and interesting.

train: epoch 01: 28/1718: 2.8 Hz loss: 62.145 train: epoch 01: 29/1718: 2.8 Hz loss: 60.642 train: epoch 01: 30/1718: 2.8 Hz loss: 59.364 Warning: An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_LAUNCH_FAILED I examined the inputs to the network at this iteration and they seem perfectly fine, thus, it's highly unlikely that the error is due to data corruption/invalid data. The GPU usage hovers around 5.5GB/ 11.5GB on a K-40 with cuda 6.5.

Any clues as to what's going wrong?

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/332.

arunmallya commented 8 years ago

No, I'm not using cuDNN

I can share the model I am using but the exact training setting might be hard to reproduce since I just load inputs from images in a folder (~7GB in size). I will try to reproduce the issue with smaller data and send that over if possible.

vedaldi commented 8 years ago

Hi, thanks. You might get away by just generating matrices of zeros in getBatch (hence without any need to sharing images).

On 30 Nov 2015, at 01:03, Arun Mallya notifications@github.com wrote:

No, I'm not using cuDNN

I can share the model I am using but the exact training setting might be hard to reproduce since I just load inputs from images in a folder (~7GB in size). I will try to reproduce the issue with smaller data and send that over if possible.

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/332#issuecomment-160490754.

arunmallya commented 8 years ago

You're absolutely right, using zero matrices also causes the error on iter 30! Here's the code: https://gist.github.com/arunmallya/c7b6c6cafa6252172727 And here's the imdb: https://www.dropbox.com/s/xapokyw8iinbcyl/imdbHICO.mat?dl=0

I really hope it isn't some silly mistake on my part causing the error :)