keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.57k stars 19.42k forks source link

Many tests failed, getting loss=nan on GPU mode #984

Closed pvnick closed 7 years ago

pvnick commented 8 years ago

All of the tests pass on CPU mode, but many tests fail when using GPU mode, and with my own network (e.g. any of the scripts in the examples folder) the loss immediately goes to NaN. My GPU is the GeForce GTX 980 Ti. I am using the most recent code in the theano and keras repositories as of this morning.

Relevant test results: auto/test_shape_inference.py FFFFFFFFFFFFFFF auto/test_tasks.py ...... auto/keras/test_activations.py FFF. auto/keras/test_constraints.py ..... auto/keras/test_normalization.py ....... auto/keras/layers/test_convolutional.py FF..... auto/keras/layers/test_core.py ................FFF auto/keras/layers/test_recurrent.py FFFFFFF

Most failures seem to be associated with the following error message:

E               ValueError: When compiling the inner function of scan the following error has been 
encountered: The initial state ('outputs_info' in scan nomenclature) of variable 
IncSubtensor{Set;:int64:}.0 (argument number 4) has dtype float32, while the result of the inner 
function ('fn') has dtype float64. This can happen if the inner function of scan results in an upcast or downcast.

Here is my .theanorc file:

floatX = float32
device = gpu0

[nvcc]
fastmath = True

Running deviceQuery in the cuda samples folder shows that the test passed.

Not sure where to go from this point :(

fchollet commented 8 years ago

The error message is clear, but it would theoretically be impossible for the internal scan loop of recurrent layers to return anything other than Theano floatX. So the cause is mysterious. Also as far as I can tell everything is looking fine on EC2 GPUs, so the problem might be specific to your GPU architecture (who knows).

You can try the following: cast all return values of the _step functions in keras.layers.recurrent to floatX, with:

val = T.cast(val, theano.config.floatX)

It's a trivial change. Let us know if it fixes your problem. It's worth trying.

farizrahman4u commented 8 years ago

That is weird by the way. Let us know if you solve it somehow.

pvnick commented 8 years ago

Hmm, that seems to fix many of the issues, but the following tests are still failing:

auto/test_shape_inference.py FFFFFFFFFFFFFFF auto/test_tasks.py ...... auto/keras/test_activations.py FFF. auto/keras/test_constraints.py ..... auto/keras/test_normalization.py ....... auto/keras/layers/test_convolutional.py FF..... auto/keras/layers/test_core.py ................FFF

With error messages that are very similar:

E TypeError: ('Bad input argument to theano function with name "/home/paul/keras/tests/auto/test_shape_inference.py:16" at index 0(0-based)', 'TensorType(float32, 3D) cannot store a value of dtype float64 without risking loss of precision. If you do not mind this loss, you can: 1) explicitly cast your data to float32, or 2) set "allow_input_downcast=True" when calling "function".',

E NotImplementedError: The image and the kernel must have the same type.inputs(float64), kerns(float32) ../../ipynb/local/lib/python2.7/site-packages/theano/tensor/nnet/conv.py:646: NotImplementedError

fchollet commented 8 years ago

you can: 1) explicitly cast your data to float32, or 2) set "allow_input_downcast=True" when calling "function".'

Very helpful, but all functions manipulated in Keras specify allow_input_downcast=True (see keras.models), for this very reason. So that's why this issue really isn't supposed to be happening.

fchollet commented 8 years ago

Try specifying floatX=float32 via command line. It's possible your .theanorc isn't being picked up.

fchollet commented 8 years ago

Very helpful, but all functions manipulated in Keras

That's true in the Keras codebase itself, but we may be instantiating custom functions in the tests. Mind checking if the failures can be linked to custom Theano functions in the tests?

fchollet commented 8 years ago

At first glance that seems to be true for at least some of the failing tests, maybe all. In that case a fix would be to add allow_input_downcast=True every time a Theano function gets instantiated in a test.

pvnick commented 8 years ago

I tried setting the floatx parameter on the command line, downgrading both keras and theano to each of their week-old code bases, and restarting the server on which things are running. None seemed to fix the issue :-/

pvnick commented 8 years ago

For the unit tests failures, that is. The recurrent network seems to be working for normal usage.

pvnick commented 8 years ago

Btw, I installed everything in a virtual environment, not globally.

fchollet commented 8 years ago

I am pretty sure the tests can be fixed by adding allow_input_downcast=True every time a Theano function gets instantiated in a test, as per my comment above. Try that.

pvnick commented 8 years ago

This does indeed appear to fix the issue. Would you like me to submit a pull request for the changes?