Open arunmallya opened 8 years ago
Hi, I would be very interested in debugging this error.
Andrea
On 27 Nov 2015, at 22:17, Arun Mallya notifications@github.com wrote:
Similar to #325 https://github.com/vlfeat/matconvnet/issues/325, but in vl_nnpool layer
Error using vl_nnpool An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_LAUNCH_FAILED
Error in dagnn.Pooling/backward (line 18) derInputs{1} = vl_nnpool(inputs{1}, self.poolSize, derOutputs{1}, ...
Error in dagnn.Layer/backwardAdvanced (line 119) [derInputs, derParams] = obj.backward ...
Error in dagnn.DagNN/eval (line 99) obj.layers(l).block.backwardAdvanced(obj.layers(l)) ;
Error in cnn_train_dag>process_epoch (line 186) net.eval(inputs, opts.derOutputs) ;
Error in cnn_train_dag (line 84) stats.train(epoch) = process_epoch(net, state, opts, 'train') ; This error occurs only after a certain number of iterations, always at iteration 30 irrespective of random seed used - this is highly strange and interesting.
train: epoch 01: 28/1718: 2.8 Hz loss: 62.145 train: epoch 01: 29/1718: 2.8 Hz loss: 60.642 train: epoch 01: 30/1718: 2.8 Hz loss: 59.364 Warning: An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_LAUNCH_FAILED I examined the inputs to the network at this iteration and they seem perfectly fine, thus, it's highly unlikely that the error is due to data corruption/invalid data. The GPU usage hovers around 5.5GB/ 11.5GB on a K-40 with cuda 6.5.
Any clues as to what's going wrong?
— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/332.
No, I'm not using cuDNN
I can share the model I am using but the exact training setting might be hard to reproduce since I just load inputs from images in a folder (~7GB in size). I will try to reproduce the issue with smaller data and send that over if possible.
Hi, thanks. You might get away by just generating matrices of zeros in getBatch (hence without any need to sharing images).
On 30 Nov 2015, at 01:03, Arun Mallya notifications@github.com wrote:
No, I'm not using cuDNN
I can share the model I am using but the exact training setting might be hard to reproduce since I just load inputs from images in a folder (~7GB in size). I will try to reproduce the issue with smaller data and send that over if possible.
— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/332#issuecomment-160490754.
You're absolutely right, using zero matrices also causes the error on iter 30! Here's the code: https://gist.github.com/arunmallya/c7b6c6cafa6252172727 And here's the imdb: https://www.dropbox.com/s/xapokyw8iinbcyl/imdbHICO.mat?dl=0
I really hope it isn't some silly mistake on my part causing the error :)
Similar to https://github.com/vlfeat/matconvnet/issues/325, but in vl_nnpool layer
This error occurs only after a certain number of iterations, always at iteration 30 irrespective of random seed used - this is highly strange and interesting.
I examined the inputs to the network at this iteration and they seem perfectly fine, thus, it's highly unlikely that the error is due to data corruption/invalid data. The GPU usage hovers around 5.5GB/ 11.5GB on a K-40 with cuda 6.5.
Any clues as to what's going wrong?