keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.92k stars 19.45k forks source link

Mean Square Error Optimizer returns Nan's #2530

Closed ArEnSc closed 7 years ago

ArEnSc commented 8 years ago

Hi I am trying to Implemement Super Resolution CNN by Microsoft http://research.microsoft.com/en-us/um/people/kahe/publications/eccv14srcnn.pdf

I am using Theano==0.8.1 Keras==1.0.1

from keras.models 
import Sequential 
from keras.layers import Dense, Dropout, Activation, Flatten 
from keras.layers import Convolution2D, MaxPooling2D 
from keras.optimizers import SGD

def build_model(): cnn_model = Sequential() cnn_model.add(Convolution2D(64,9,9,border_mode='valid',init='glorot_normal', input_shape=(3,32,32))) cnn_model.add(Activation('relu')) cnn_model.add(Convolution2D(32,1,1,border_mode='valid',init='glorot_normal')) cnn_model.add(Activation('relu')) cnn_model.add(Convolution2D(3,5,5,border_mode='valid',init='glorot_normal'))

 0.0001 10^-4 | 0.00001 10^-5 last layer
sgd = SGD(lr=10e-4)
cnn_model.compile(loss='mean_squared_error', optimizer='sgd')

print "Model Build"
return cnn_model

When training the model this results in a loss of NAN, I have changed the optimizer to use rmsprop and mean_squared_logarithmic_error and it appears to work now.

carlthome commented 8 years ago

If it seems to work, feel free to close the issue.

ArEnSc commented 8 years ago

@carlthome 'mean_squared-error' with sgd does not work.

mean_squared_logarithmic_error rmsprop

works well

carlthome commented 8 years ago

@ArEnSc if you get NaNs with a particular optimizer and loss function a lot of stuff could have gone wrong, but it's not necessarily a bug. It depends on your data. Sometimes a model will not converge to a solution. Typically SGD would not diverge but just take longer, but in this particular case the initial learning rate is higher than for RMSProp (Keras' default values, I think). With SGD you use the same learning rate for all features. With RMSProp you make sure to start with a lower learning rate (Keras' default anyway) and then scale it per feature as you progress.

So it's entirely possible that you will overshoot in one direction resulting in never actually getting to a minima, and the loss function looking like on the left after some weight updates image

This is a pretty neat reading: http://sebastianruder.com/optimizing-gradient-descent/

Then again this might just have to do with regular underflow if you use float16 (or maybe even float32) and sufficiently small values. Consider trying float64 and see if the NaNs go away.

ArEnSc commented 8 years ago

@carlthome Hey I have tried setting it to float64 and I also lowered the learning rate, and tried both it still causes nan loss and a 14.46 accuracy ? :S thanks for the article but I think MSE could be broken ?

ArEnSc commented 8 years ago

The paper states that it requies a 10e-5 for the first two layers and the last layer should be 10E-6.

   def build_model():
    cnn_model = Sequential()
    cnn_model.add(Convolution2D(64,9,9,border_mode='valid',init='glorot_normal', input_shape=(3,32,32)))
    cnn_model.add(Activation('relu'))
    cnn_model.add(Convolution2D(32,1,1,border_mode='valid',init='glorot_normal'))
    cnn_model.add(Activation('relu'))
    cnn_model.add(Convolution2D(3,5,5,border_mode='valid',init='glorot_normal'))

    # 0.0001 10^-4 | 0.00001 10^-5 last layer
    sgd = SGD(lr=10e-6)
    cnn_model.compile(loss='mean_squared_error', optimizer='sgd',metrics=["accuracy"])

    print "Model Build"
    return cnn_model
Epoch 1/7000
4525/4525 [==============================] - 4s - loss: nan - acc: 0.1469     
Epoch 2/7000
4525/4525 [==============================] - 4s - loss: nan - acc: 0.1468     
Epoch 3/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 4/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 5/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 6/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 7/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 8/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 9/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 10/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 11/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 12/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 13/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 14/7000
4525/4525 [==============================] - 3s - loss: nan - acc: 0.1468     
Epoch 15/7000
4525/4525 [==============================] - 4s - loss: nan - acc: 0.1468     
Epoch 16/7000
4525/4525 [==============================] - 4s - loss: nan - acc: 0.1468     
Epoch 17/7000
4525/4525 [==============================] - 4s - loss: nan - acc: 0.1468     
Epoch 18/7000
1216/4525 [=======>......................] - ETA: 3s - loss: nan - acc: 0.1483

could it be how I initialize the weight matrices ? I did try the default setting then the glorot_normal.

ArEnSc commented 8 years ago

@carlthome Hey I just tried this with sgd and msle instead it works.... mse still doesn't work oddly perhaps someone should give it a check ?

carlthome commented 8 years ago

The mse implementation is not broken. Note the working examples in the repo.

By the way, you're not using your sgd object because you're calling compile with a 'sgd' string...

sgd = SGD(lr=10e-6)
cnn_model.compile(loss='mean_squared_error', optimizer='sgd',metrics=["accuracy"])
ArEnSc commented 8 years ago

@carlthome I did not see that..... thanks!!!! but It didn't make a difference...

cerisara commented 8 years ago

I guess the 'overshooting' situation may be checked by plotting the gradient norm, but how can we check whether NaN's are due to "regular underflow" ?

I agree this might not be a bug, but a nice enhancement would be to have safeguards in the code for such situations, or some informative warning messages.

Or may be also add a section in the FAQ giving hints on how to debug the situation ? Thanks anyway...

(I'm actually also running into NaN's or Inf issues, with several optimizers, very hard to debug...

post-edit: Forget about it, mea culpa ! I actually had one NaN label somewhere. So nothing to change in Keras, which is definitely perfect as it is !! :-) )

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.