Very low or NaN accuracy in the examples when running on a GPU with float32

jbmorgado commented 9 years ago

I've been trying all the MNIST examples from the Keras documentation as well as the cifar10 and they just don't work if I use the GPU.

The Accuracy is always below 0.1 or NaN.

If I do it on the CPU it works correctly and if I do it in the GPU with float64 it also works correctly although slowly.

I've tried to restart the laptop |(like another user suggests) and also cleaning and purging theano-cache, but the problem remains.

My system is a Mac OS X 10.10. With a GTX 750M.

Theano without Keras works properly in GPU float32 with other libraries/examples.

fchollet commented 9 years ago

All these examples work fine on my GT650M on OSX. I guess this is a 750M issue.

Try lowering the epsilon value for the GPU in optimizers.py and tell me what you get.

jbmorgado commented 9 years ago

I tried to change the epsilon for several values ranging from 1e-6 to 1e-20 in optimizers.py and the bug remains the same.

About being a 750M issue, I doubt it because like said, this sort of operations work fine directly in Theano, or using Lasagne or even an handmade library. The accuracy problem only arrises when using Keras.

fchollet commented 9 years ago

About being a 750M issue, I doubt it because like said, this sort of operations work fine directly in Theano, or using Lasagne or even an handmade library. The accuracy problem only arrises when using Keras.

It's a float32 overflow, and it happens on the 750M but not the 650M for the same code. Surely you can understand this.

from 1e-6 to 1e-20 in optimizers.py and the bug remains the same.

I realize I wasn't clear. I meant the epsilon in objectives.py, and I meant lowering the exponent. Try 10e-5 and 10e-4. You can also try changing the epsilon value in the optimizer you are using, by lowering the exponent, but I believe the one value that is causing the overflow would be the one in objectives.py.

jbmorgado commented 9 years ago

Tried changing for epsilon to 10e-5 and 10e-4 to no avail. I also constructed a model using SGD propagation which doesn't use epsilon and it also only works in float64.

w0nk0 commented 9 years ago

I have the same issue whenever I try to train GRU or LSTM with categorical_crossentropy and adam, rmsprop or SGD. The network will converge just fine for a while and then suddenly go to NaN.

I've also tried fiddling with epsilon, to no avail. At this point, keras is effectively unusable for me because of it :(.

jbmorgado commented 9 years ago

Just as a clarification, I found out that float64 is actually using the CPU even if you configure Theano to use the GPU. So this is a issue arising when you use GPU with Keras.

fchollet commented 9 years ago

All GPUs only perform float32 operations. Well I guess you could perform virtual float64 operations emulated via hardware float32 operations, but that's not something Theano would do.

fchollet commented 9 years ago

The interesting thing here is that the issue is specific to certain GPUs, which means the cause of the float32 overflow is unlikely to be found in the Keras code (if the Keras code was at fault, the overflow would happen every time the code is run in float32, i.e. on every GPU).

jbmorgado commented 9 years ago

But on the other side, if the Keras code was not at fault the issue would happen every time you would try something similar with any library that uses Theano. Yet, I've tried similar runs in pure Theano and in Lasagne (another library that uses Theano as the backend like Keras does) and it runs just fine. So it must be something that Keras does and it's not supposed to do, even if it works fine with other cards.

pcyin commented 9 years ago

Could you pls try NanGuardMode and tell us what you get?

jbmorgado commented 9 years ago

Where can I change that setting? I've tried NanGuardMode = True in .theanorc and the results remain the same.

ddsh commented 9 years ago

I ran into this problem as well. I had to upgrade scipy to 0.16.0 to fix it (if you are using gensim, that might be problematic though).

jbmorgado commented 9 years ago

I have scipy 0.16.0 installed and the problem remains.

An added information, an user in the Keras mailing list thread about this problem says he also has it and that he uses a Titan X.

fchollet commented 9 years ago

But on the other side, if the Keras code was not at fault the issue would happen every time you would try something similar with any library that uses Theano.

This is most likely due to a specific feature of Theano being used by Keras and which doesn't play well with this specific GPU (Titan X). You can still run Keras on any other GPU, or use other Theano features on the Titan X.

It would be interesting to determine which feature is at fault, specifically. Then we can open an issue with the Theano devs.

In any case, since this problem is GPU-specific and since Keras does not have any GPU-specific code (it simply calls Theano as its computation backend), the issue definitely lies with Theano.

bhokaal2k commented 9 years ago

@morgado-developer I have had the same issue with Keras code. I replaced the get_output() function for Convolutional2D class with the code below and it solved the issue. If you are using cuDNN make sure to remove it completely (by deleting all the *.so files and cudnn.h file from cuda installation) because theano uses cuDNN by default, if present, and set optimizer_including=conv_gemm in the THEANO_FLAGS. Please let me know if the fix works for you -

def get_output(self, train):
        X = self.get_input(train)
        border_mode = self.border_mode
        if border_mode == 'same':
            border_mode = 'full'
#         pdb.set_trace()
        X = gpu_contiguous(X)
        conv_out = theano.sandbox.cuda.blas.GpuCorrMM(border_mode=border_mode,\
                subsample=self.subsample)\
                (X, self.W)
#         conv_out = theano.tensor.nnet.conv.conv2d(X, self.W,
#             border_mode=border_mode, subsample=self.subsample)

        if self.border_mode == 'same':
            shift_x = (self.nb_row - 1) // 2
            shift_y = (self.nb_col - 1) // 2
            conv_out = conv_out[:, :, shift_x:X.shape[2] + shift_x, shift_y:X.shape[3] + shift_y]

        return self.activation(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))

flyboys3000 commented 9 years ago

just set optimizer_excluding=cudnn would also work. but it seems like a major setback if cudnn cannot be used

fchollet commented 9 years ago

just set optimizer_excluding=cudnn would also work. but it seems like a major setback if cudnn cannot be used

This recent change might have helped, check it out: https://github.com/fchollet/keras/blob/master/keras/layers/convolutional.py#L152-L164

dhammack commented 9 years ago

I had this problem and solved it by getting the latest Theano from github. I am using cuDNN v3 with a Titan X on Win8.

flyboys3000 commented 9 years ago

great, upgrading to the github theano solved the problem

fchollet commented 9 years ago

So, everyone confirms that this issue is definitely solved with the latest Theano and latest Keras?

w0nk0 commented 9 years ago

I had gotten bleeding edge keras and theano 2 weeks ago to no available, haven't tried since.

On 27 August 2015 23:01:01 François Chollet notifications@github.com wrote:

So, everyone confirms that this issue is definitely solved with the latest Theano and latest Keras?

Reply to this email directly or view it on GitHub: https://github.com/fchollet/keras/issues/511#issuecomment-135553852

rodrigob commented 9 years ago

I am using keras 2c30d503eada5cb5429b6f6d8ced0e996760e40e (github on august 27th) and theano 0.7.0.dev-856aa0b6d3454ff1b4d00575e1ec38f27aedb7d9.

rodrigob commented 9 years ago

Just tried with keras head 332d43e023073561fec53828ee21e206ac1b34b1 and theano head '0.7.0.dev-dc13bfcaa165b0d2d24ec509944da9f29114470b' using python3.4 mnist_cnn.py on a Tesla K40m prints a nan loss. The when using cpu configuration works fine.

rodrigob commented 9 years ago

Just for the fun, also tried with python2.7 (instead of python3.4) and there I have the same nan behaviour.

apcode commented 9 years ago

Just adding some more evidence for this.

I am using an iMac with GTX 980M and was seeing the same problem. I was running examples/addition_rnn.py and it never converges on GPU but did on CPU.

Running with optimizer_excluding=cudnn fixes the issue. E.g. THEANO_FLAGS=device=gpu,floatX=float32,cuda.root=/usr/local/cuda,optimizer_excluding=cudnn python examples/addition_rnn.py

I also updated to the bleeding edge theano as suggested on this thread: pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

And this now works fine. I.e. the original failing version now works correctly, as reported by others on this thread. THEANO_FLAGS=device=gpu,floatX=float32,cuda.root=/usr/local/cuda python examples/addition_rnn.py

fchollet commented 9 years ago

Excellent. We'll consider this resolved then.

I believe the NaN loss on GPU reported by @rodrigob is a different issue (which we also encountered in the past). We'll resolve it separately.

andykitchen commented 9 years ago

Hi all, just ran into this bug (or a very similar bug) running 0.1.3, updating to the current master (c18a9cd405f29040ebb259aef74963c4b2134494) has fixed this problem for me.

jiehanwang commented 8 years ago

Hi all. I fixed the problem by replacing the layer ReLU as SOFTPLUS. Maybe, you can try it. I use the Bleeding Edge version of Theano and the latest version of Kereas.

lmoesch commented 8 years ago

Issues still remains, although disabling cudnn optimisation:

Code: mnist_cnn.py with SGD Graphics: GTX980 Ti

.thenorc: [global] floatX=float32 optimizer_excluding=cudnn

[lib] cnmem=0.7

[nvcc] fastmath=True

using bleeding edge theano and keras as well as with keras (20dc637)

keras-team / keras

Very low or NaN accuracy in the examples when running on a GPU with float32 #511