Closed jbmorgado closed 9 years ago
All these examples work fine on my GT650M on OSX. I guess this is a 750M issue.
Try lowering the epsilon
value for the GPU in optimizers.py
and tell me what you get.
I tried to change the epsilon
for several values ranging from 1e-6 to 1e-20 in optimizers.py
and the bug remains the same.
About being a 750M issue, I doubt it because like said, this sort of operations work fine directly in Theano, or using Lasagne or even an handmade library. The accuracy problem only arrises when using Keras.
About being a 750M issue, I doubt it because like said, this sort of operations work fine directly in Theano, or using Lasagne or even an handmade library. The accuracy problem only arrises when using Keras.
It's a float32 overflow, and it happens on the 750M but not the 650M for the same code. Surely you can understand this.
from 1e-6 to 1e-20 in optimizers.py and the bug remains the same.
I realize I wasn't clear. I meant the epsilon in objectives.py
, and I meant lowering the exponent. Try 10e-5 and 10e-4. You can also try changing the epsilon value in the optimizer you are using, by lowering the exponent, but I believe the one value that is causing the overflow would be the one in objectives.py.
Tried changing for epsilon
to 10e-5 and 10e-4 to no avail. I also constructed a model using SGD
propagation which doesn't use epsilon
and it also only works in float64.
I have the same issue whenever I try to train GRU or LSTM with categorical_crossentropy and adam, rmsprop or SGD. The network will converge just fine for a while and then suddenly go to NaN.
I've also tried fiddling with epsilon, to no avail. At this point, keras is effectively unusable for me because of it :(.
Just as a clarification, I found out that float64
is actually using the CPU even if you configure Theano to use the GPU. So this is a issue arising when you use GPU with Keras.
All GPUs only perform float32
operations. Well I guess you could perform virtual float64
operations emulated via hardware float32
operations, but that's not something Theano would do.
The interesting thing here is that the issue is specific to certain GPUs, which means the cause of the float32 overflow is unlikely to be found in the Keras code (if the Keras code was at fault, the overflow would happen every time the code is run in float32, i.e. on every GPU).
But on the other side, if the Keras code was not at fault the issue would happen every time you would try something similar with any library that uses Theano. Yet, I've tried similar runs in pure Theano and in Lasagne (another library that uses Theano as the backend like Keras does) and it runs just fine. So it must be something that Keras does and it's not supposed to do, even if it works fine with other cards.
Could you pls try NanGuardMode
and tell us what you get?
Where can I change that setting?
I've tried NanGuardMode = True
in .theanorc
and the results remain the same.
I ran into this problem as well. I had to upgrade scipy to 0.16.0 to fix it (if you are using gensim, that might be problematic though).
I have scipy 0.16.0 installed and the problem remains.
An added information, an user in the Keras mailing list thread about this problem says he also has it and that he uses a Titan X.
But on the other side, if the Keras code was not at fault the issue would happen every time you would try something similar with any library that uses Theano.
This is most likely due to a specific feature of Theano being used by Keras and which doesn't play well with this specific GPU (Titan X). You can still run Keras on any other GPU, or use other Theano features on the Titan X.
It would be interesting to determine which feature is at fault, specifically. Then we can open an issue with the Theano devs.
In any case, since this problem is GPU-specific and since Keras does not have any GPU-specific code (it simply calls Theano as its computation backend), the issue definitely lies with Theano.
@morgado-developer I have had the same issue with Keras code. I replaced the get_output() function for Convolutional2D class with the code below and it solved the issue. If you are using cuDNN make sure to remove it completely (by deleting all the *.so files and cudnn.h file from cuda installation) because theano uses cuDNN by default, if present, and set optimizer_including=conv_gemm in the THEANO_FLAGS. Please let me know if the fix works for you -
def get_output(self, train):
X = self.get_input(train)
border_mode = self.border_mode
if border_mode == 'same':
border_mode = 'full'
# pdb.set_trace()
X = gpu_contiguous(X)
conv_out = theano.sandbox.cuda.blas.GpuCorrMM(border_mode=border_mode,\
subsample=self.subsample)\
(X, self.W)
# conv_out = theano.tensor.nnet.conv.conv2d(X, self.W,
# border_mode=border_mode, subsample=self.subsample)
if self.border_mode == 'same':
shift_x = (self.nb_row - 1) // 2
shift_y = (self.nb_col - 1) // 2
conv_out = conv_out[:, :, shift_x:X.shape[2] + shift_x, shift_y:X.shape[3] + shift_y]
return self.activation(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
just set optimizer_excluding=cudnn would also work. but it seems like a major setback if cudnn cannot be used
just set optimizer_excluding=cudnn would also work. but it seems like a major setback if cudnn cannot be used
This recent change might have helped, check it out: https://github.com/fchollet/keras/blob/master/keras/layers/convolutional.py#L152-L164
I had this problem and solved it by getting the latest Theano from github. I am using cuDNN v3 with a Titan X on Win8.
great, upgrading to the github theano solved the problem
So, everyone confirms that this issue is definitely solved with the latest Theano and latest Keras?
I had gotten bleeding edge keras and theano 2 weeks ago to no available, haven't tried since.
On 27 August 2015 23:01:01 François Chollet notifications@github.com wrote:
So, everyone confirms that this issue is definitely solved with the latest Theano and latest Keras?
Reply to this email directly or view it on GitHub: https://github.com/fchollet/keras/issues/511#issuecomment-135553852
I am using keras 2c30d503eada5cb5429b6f6d8ced0e996760e40e (github on august 27th) and theano 0.7.0.dev-856aa0b6d3454ff1b4d00575e1ec38f27aedb7d9.
Just tried with keras head 332d43e023073561fec53828ee21e206ac1b34b1
and theano head '0.7.0.dev-dc13bfcaa165b0d2d24ec509944da9f29114470b' using python3.4 mnist_cnn.py
on a Tesla K40m prints a nan loss. The when using cpu configuration works fine.
Just for the fun, also tried with python2.7 (instead of python3.4) and there I have the same nan behaviour.
Just adding some more evidence for this.
I am using an iMac with GTX 980M and was seeing the same problem. I was running examples/addition_rnn.py and it never converges on GPU but did on CPU.
Running with optimizer_excluding=cudnn fixes the issue. E.g. THEANO_FLAGS=device=gpu,floatX=float32,cuda.root=/usr/local/cuda,optimizer_excluding=cudnn python examples/addition_rnn.py
I also updated to the bleeding edge theano as suggested on this thread: pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git
And this now works fine. I.e. the original failing version now works correctly, as reported by others on this thread. THEANO_FLAGS=device=gpu,floatX=float32,cuda.root=/usr/local/cuda python examples/addition_rnn.py
Excellent. We'll consider this resolved then.
I believe the NaN loss on GPU reported by @rodrigob is a different issue (which we also encountered in the past). We'll resolve it separately.
Hi all, just ran into this bug (or a very similar bug) running 0.1.3, updating to the current master (c18a9cd405f29040ebb259aef74963c4b2134494) has fixed this problem for me.
Hi all. I fixed the problem by replacing the layer ReLU as SOFTPLUS. Maybe, you can try it. I use the Bleeding Edge version of Theano and the latest version of Kereas.
Issues still remains, although disabling cudnn optimisation:
Code: mnist_cnn.py with SGD Graphics: GTX980 Ti
.thenorc: [global] floatX=float32 optimizer_excluding=cudnn
[lib] cnmem=0.7
[nvcc] fastmath=True
using bleeding edge theano and keras as well as with keras (20dc637)
I've been trying all the MNIST examples from the Keras documentation as well as the cifar10 and they just don't work if I use the GPU.
The Accuracy is always below 0.1 or NaN.
If I do it on the CPU it works correctly and if I do it in the GPU with float64 it also works correctly although slowly.
I've tried to restart the laptop |(like another user suggests) and also cleaning and purging theano-cache, but the problem remains.
My system is a Mac OS X 10.10. With a GTX 750M.
Theano without Keras works properly in GPU float32 with other libraries/examples.