keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.15k stars 19.49k forks source link

Loss turns into 'nan' when running on GPU #1244

Closed lmoesch closed 3 years ago

lmoesch commented 8 years ago

Like previously stated in issue #511 Keras runs into not a number losses while training on GPU. Tested this with the mnist_cnn example code aswell as with self designed conv networks. I also tried to disable cuDnn, aswell as increasing the epsilon and setting a clinorm. Nothing solved the poblem.

I'm using the latest version of Theano and Keras. And SGD optimisation with categorical crossentropy.

Graphics: GTX 980 Ti

fchollet commented 8 years ago

I'd like to identify what op is causing this issue.

lmoesch commented 8 years ago

Here is the net-part of my code. I'll try other loss functions, but they take some time to provide useful evidence, since you can't determine when the loss turns into 'nan'.

img_rows = img_cols = 128
img_channels = 3
l1 = l2 = 0

# convert data for GPU use
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices
y_train = np_utils.to_categorical(y_train, nb_classes)
y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()

model.add(Convolution2D(16, 5, 5, border_mode='same',
                        input_shape=(img_channels, img_rows, img_cols), W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(16, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(32, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Convolution2D(64, 3, 3, border_mode='valid', W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Flatten())
model.add(Dense(1024, W_regularizer=l1l2(l1, l2)))
model.add(Activation('relu'))
model.add(Dropout(0.6))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,shuffle=True, show_accuracy=True, callbacks=[history])
hr0nix commented 8 years ago

As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

lmoesch commented 8 years ago

I first tested the 'tanh' activation wich didn't helped. It was no suprise though, since this problem is specific to calculations on the GPU and not a general one with numerical stability.

I also tried mse as loss function, which ran into 'nan' aswell.

fchollet commented 8 years ago

I also tried mse as loss function, which ran into 'nan' aswell.

In that case the overflow is happening earlier in the graph.

Next, you could try removing the regularizers.

BTW the history callback is included by default, no need to specify it manually.

lmoesch commented 8 years ago

Correct me if I'm wrong, but with l1 = l2 = 0 it should not matter that l1l2-regulizer is defined?

But I'll try to remove them.

fchollet commented 8 years ago

Of course, it should not matter. Also, there should not be a float32 overflow.

lmoesch commented 8 years ago

Okay, I removed all W-regulizers and the 'nan' loss still occurs.

I noticed that the netloss is more likely to output 'nan' when used deeper (e.g. 8 conv layers instead of 6) and wider (e.g. 512 feature maps) networks.

fariasfc commented 8 years ago

I am also having problem with nans on loss.

I realized that the weights became nan, but I dont know if this changed before or after the loss calculation. Which is very strange, since the values to calculate crossentropy is clipped before apply the objective function...

qqgeogor commented 8 years ago

The same thing happened to me instead I was using keras to build a regression model. I have tried different loss(rmse or mae) and also sigmoid tanh apart from relu. Nothing helps to imporve this case.

fariasfc commented 8 years ago

I think I've fixed it on the PR #1368. @fchollet what do you think?

lmoesch commented 8 years ago

I get your point on preventing division by zero, but this doesn't explain why this problem is specific to certain GPU especially the GTX 9XX series. (Never had a problem on my GTX 670).

tylerklement commented 8 years ago

I'm getting the same problem. I think this is a problem with the system configuration more so than with the code. My code used to work, but then I had to reformat my computer and reinstall everything, and now I'm getting "nan" loss. So I think it's something with the configuration of Theano, CUDA, Visual Studio, or CuDNN, at least in my case. Still trying to figure it out.

azzever commented 8 years ago

I'm also getting this problem (Ubuntu 14.04, GTX 980Ti/970, Theano as backend, CNN with residual units, ReLU, BN, mse/mae loss).

In my case problem occurred randomly, the probability of getting nan is increasing with model's complexity (and memory usage). When loss become nan loading of saved weights doesn't help to continue training (weights become corrupted on first training iteration). Only recompilation or creating new model allow to continue training.

tylerklement commented 8 years ago

It works for me now. I had installed cuDNN incorrectly - previously I had just dragged the cuDNN files and dropped them in the CUDA folder, replacing anything with the same name. So I re-installed Visual Studio (2013), Anaconda, Theano, and Keras. It still gave me "nan". So then, I installed cuDNN, but this time, I did this by extracting the cuDNN files to their own directory, and then just added that directory to my path. I think that was the key factor for me: installing cuDNN (properly). I was using relu and adam the whole time.

djstrong commented 8 years ago

The same problem on Tesla M2090. I tried consume_less gpu and cpu. GRU is working ok.

WenchenLi commented 8 years ago

Anybody any progress on this issue?

9thDimension commented 8 years ago

I had this problem - nets that worked perfectly fine on various CPU hardware failed to train on AWS GPU-enabled remote machine.

I removed Theano 0.8.0, and upgraded to the bleeding-edge version from GitHub (which is 0.9.0-dev2). Now training works perfectly.

Can't blame this one on Keras, folks!

djstrong commented 8 years ago

On CPU I am getting nans too, but after more epochs than on GPU.

ersinyar commented 8 years ago

I have the same problem. I train an LSTM network with my own data. The train_loss becomes NaN suddenly. I checked my code with imdb dataset. It is working OK. But when I switch to my dataset nan problem occurs. I preprocessed my data in the same way that imdb dataset preprocessed in imdb_lstm example of keras. I do not understand what the problem is. It seems that network configuration is OK since it run with another dataset. However, my dataset and imdb dataset are both text. How come does another text dataset cause this issue? I tried gradient clipping also weight norm limitations. I think sudden change happens when inf value is calculated with categorical_cross entropy function such as log(0). But how can I determine and avoid this problem?

eyal-str commented 8 years ago

I also had this problem. I fixed it when I changed the Y values to float numbers. For example 0.0 1.0 instead of 0 and 1.

ghost commented 8 years ago

like @9thDimension said, upgrading Theano to the bleeding-edge version (0.9.0-dev2) seems to have fixed the nan issues for me so far on debian wheezy. i'm using a python 3.5.2 env in anaconda 4.1.1.

i just followed the instructions from the theano website here: pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

skerit commented 8 years ago

I'm training on the CPU and using tensorflow as backend, also getting the nan issue.

svolchkov commented 8 years ago

Was having the same issue with a regression task. Upgrading Theano didn't work but changing the optimizer from 'sgd' to 'rmsprop' seemed to help.

patrick-ogrady commented 8 years ago

@skerit did you figure it out?

patricio-astudillo commented 8 years ago

I had the nan-problem as wel and I solved it by changing the floatx value in ~/.keras/keras.json from float32 to float64. (tested on GPU)

This is the description of my setup: Backend: tensorflow and theano Optimizer: Adam GPU: Titan X and GTX 970 Activations: RELU Last layer activation: sigmoid Objective: binary cross entropy

If more details are needed, let me know.

EDIT: the problem was not solved by this but the training lasted longer so an acceptable loss variable was reached EDIT2: after re-reading the images and saving them, the training lasted even longer

nouiz commented 8 years ago

Did you time it? It if you use the current GPU backend, this cause all competition to be on the CPU.

Le 23 nov. 2016 06:56, "Patricio Astudillo" notifications@github.com a écrit :

I had the nan-problem as wel and I solved it by changing the floatx value in ~/.keras/keras.json from float32 to float64. This is the description of my setup: Backend: tensorflow and theano Optimizer: Adam GPU: Titan X and GTX 970 Activation: RELU Last layer activation: sigmoid Objective: binary cross entropy If details are needed, let me know.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/1244#issuecomment-262495086, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC--4Nl68nV3NmAV1WLgiZKfSjHYilks5rBCoHgaJpZM4GzdUg .

farizrahman4u commented 8 years ago

I was having the same issue. Disabled, CudNN (optimizer_exclude=cudnn); everything works fine. And slow.

jphalip commented 7 years ago

I too ran into a similar issue where the loss and layer weights would suddenly be set to nan during training with floatx as float32 (it worked fine with float64 but was much slower).

I was able to fix this by applying either the clipnorm or clipvalue optimizer attributes (https://keras.io/optimizers/#parameters-common-to-all-keras-optimizers). It seems that for me this was a case of exploding gradients, which may not be true for all cases reported here. I just thought I'd mention what worked for me in case that's helpful to others.

svolchkov commented 7 years ago

I used clipnorm, too, and it allowed me to use the adam optimizer. I wonder if using clipnorm might have a negative impact on the accuracy.

ghost commented 7 years ago

i know that clipnorm fix this issue and i know that clipnorm clip the big number of the gradients but i want to know why the nan is produced? why do i see loss=nan when i don't use clipping of the gradient?

sergsb commented 7 years ago

I have the same issue during training on GPU 3D Convolutional network. I use float32 and theano backend.

monajalal commented 7 years ago

I have the following in ~/.keras/keras.json:

  1 {
  2     "floatx": "float32",
  3     "epsilon": 1e-07,
  4     "image_dim_ordering": "tf",
  5     "backend": "tensorflow"
  6 }

and I got nan:

mona@pascal:~/computer_vision/VPilot$ python train.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:1938: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))
Epoch 1/1000
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x4750d80
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:83:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40c, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 4777 get requests, put_count=3270 evicted_count=1000 eviction_rate=0.30581 and unsatisfied allocation rate=0.54574
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110
    4/70629 [..............................] - ETA: 364851s - loss: 0.5890I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 755 get requests, put_count=1771 evicted_count=1000 eviction_rate=0.564653 and unsatisfied allocation rate=0
    8/70629 [..............................] - ETA: 194931s - loss: 0.5553I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 247 get requests, put_count=1270 evicted_count=1000 eviction_rate=0.787402 and unsatisfied allocation rate=0
   13/70629 [..............................] - ETA: 129454s - loss: 0.5582I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5071 get requests, put_count=4961 evicted_count=2000 eviction_rate=0.403145 and unsatisfied allocation rate=0.423979
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 449 to 493
   18/70629 [..............................] - ETA: 100341s - loss: 0.5194I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5145 get requests, put_count=5327 evicted_count=2000 eviction_rate=0.375446 and unsatisfied allocation rate=0.365986
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 720 to 792
   25/70629 [..............................] - ETA: 79355s - loss: 0.5875I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5137 get requests, put_count=5388 evicted_count=1000 eviction_rate=0.185598 and unsatisfied allocation rate=0.175784
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 1694 to 1863
70629/70629 [==============================] - 25358s - loss: nan - val_loss: nan
Epoch 2/1000
70629/70629 [==============================] - 24899s - loss: nan - val_loss: nan
Epoch 3/1000
70629/70629 [==============================] - 24967s - loss: nan - val_loss: nan
Epoch 4/1000
70629/70629 [==============================] - 24987s - loss: nan - val_loss: nan
Epoch 5/1000
70629/70629 [==============================] - 24855s - loss: nan - val_loss: nan
Epoch 6/1000
70629/70629 [==============================] - 24977s - loss: nan - val_loss: nan
univ12 commented 7 years ago

I have this problem on keras 1.2.1 and theano 0.9.0b1. My epochs are already starting with nan. Adding a clipvalue=1, changing the learning rate and trying different optimizers did not help.

vwrs commented 7 years ago

I also have the same issue training LSTM network by multi_gpu.py, using mse as loss function.

vvpreetham commented 7 years ago

I get NaN for a linear regressor at the time of model.evaluate for Adam or a tf backend FTRL optimizer. Have tried changing parameter size of the NN arch, played around with learning_rates, regularizers, clipping etc.. No luck. I am running on 3 Tesla-X GPU.

(BTW, happens only when I allocate more than 1 GPU)

jmaronas commented 7 years ago

I will post my experience and my solution. One of the key things of saturation has really nothing to do with the cost or the parameters, but the updates. The softmax function has some tricks for preventing overflow when the previous layer is a ReLu that can output high values. Im really sure theano implements this tricks but I have not checked it.

So normally desactivating cudNN can solve the problem. I experience this problem on a classification convolutional neural network and on a MSE fully connected neural network. With good parameter initialization and data normalization I have saturation (did not initialize the weights with values in a 10³ order for example). First deactivate cudNN as this makes approximations.

Then, with the same code I experienced saturation by changing the theano version. In one theano version it does not saturates and in the other it does. Moreover depending on the GPU I also see saturation. On a GTX 1070 I have more saturation than on a GTX 1080. Hopefully with the new theano back-end we will have float64 adaptation but for the moment it does not seem to happen.

So finally the way I solved this is by scaling the cost function. Saturation sometimes happens because on an early layer the derivative respect to a weight is a combination of a sumation of mini-batch size (and as soon as you go early in your network more sumations influence). Lots of sums can make higher values that end up making a saturated update. Since scaling is a monotic transformation it would not change the optimization point. Simple take your cost and scale it by 0.00001 as example. This solved my problem.

Remark my sum of squared error was normalized by batch size and by my factor. Hope this helps.

vvpreetham commented 7 years ago

An update on the bug, (I run this on Tesla-X GPU).

I do consistently get the error when I use sample_weights. The model has a sparse input size of about 8000 neurons and the first layer is an SQRT reduction of the size.

with tf.device('/cpu:0'):
    width = wide_array_width(wide_col_len_dict)
    reduction = wide_reduce(width)        
    model = Sequential()
    model.add(Dense(reduction, input_dim=width, activation='softplus'))
    if(middle_layer):
        model.add(Dense(wide_reduce(reduction), activation='softplus',W_constraint = maxnorm(2)))
    #final_layer              
    model.add(Dense(1, init='normal',activation='sigmoid'))                                           model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.001,
                                                         beta_1=0.9, 
                                                         beta_2=0.999, 
                                                         epsilon=10e-04, 
                                                         decay=0.0,
                                                         clipnorm=1.0,
                                                         clipvalue=0.3))

The model trains if I comment out the sample_weight section (But trains the model horribly wrong)

    hist = model.fit(input_dense_matrix, 
                     labels, 
                     nb_epoch=train_steps, 
                     verbose=0, 
                     shuffle=True,
                     validation_split=0.2,
                     batch_size=60,
                     sample_weight=sample_weights_,
                     callbacks=[early_stopping, checkpointer])
ghost commented 7 years ago

I am having the same issue for this network:

class FeedForward:

    def __init__(self, input_dim, nb_classes):

        in_x = Input(shape=(input_dim, ), name='in_x')
        h1 = Dense(14, name='h1', activation='tanh')(in_x)
        h2 = Dense(8, name='h2', activation='tanh')(h1)
        out = Dense(nb_classes, name='out', activation='tanh')(h2)

        self.model = Model(input=[in_x], output=[out])

    def compile_model(self, optimizer='adam', loss='mse'):
        self.model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

loss will always be nan unless I wrap everything into with tf.device('/cpu:0'): and run the calculations on the CPU.

MClarkTurner commented 7 years ago

I'm having a similar issue with my new Titan X running on TF 1.0.1 using CUDA 8.0 and CuDNN 5.1.10. I have tried clipping the gradients but I've had no luck. My model works fine on CPU but within 100 iterations of mini-batches size 10 I inevitably get NaN values when running on my GPU.

Is it possible this is a problem with my installation of CUDA, CuDNN, of TF? I've tried downloading TF from source to no avail. Has any one had any luck going back to CUDA 7.5 and CuDNN 4?

EDIT: So after a lot of work I found out that this was an error with my code and not with the architecture. Apparently nans can become more prevalent depending on your environment but at its core this seems to be an issue with my model.

rohankshir commented 7 years ago

I am also getting the issue when I add regularization (for an attention layer). I played with kernel regularization and activity regularization and they are both resulting in nan's. I get nan on both GPU and CPU training

GZuin commented 7 years ago

Also having this issue on GeForce GTX 1060. Training on CPU works OK, on GPU loss becomes nan after the first batch update.


Tried multiple versions of cuDNN (all of them were 5005 or more recent though), theano (0.8.something and 0.9.0) and keras (1.2.something and 2.somethig). All had the same problem.


Tried disabling cuDNN through .theanorc and scaling my loss by 0.00001. Neither solved the issue (although I was still seeing the cudnn 5005 in the 'Using gpu' message ...)

Things worth of note:

Tried running the same program on a Tesla K40c. Same story.


Tried decreasing the batch_size. So far so good, haven't seen the nan error yet. My batch size before was 250. It made the first loss calculationm (0.0853) and then turned to nan at 500/80000. Now I'm using a batch size of 2 (went to the extreme). Im currently at 1000/80000 without any problems. Will try different batch sizes and find the one that works the best for me.

Again, 500 and 1000 are only the chunks of data processed, This is all within the first epoch


Hope this might help people in the future with the same problem I had.

Phylliida commented 7 years ago

I kept having this issue which was annoying because I would train something overnight and in the morning it was nan. I think I fixed it now, I haven't got nans after about a day of training but I'll update my comment if I do.

To fix this, you have to do three things:

Add a very minor bias and weight regularizer to every layer

model.add(Dense(hiddenSize, kernel_regularizer=l2(0.00001), bias_regularizer=l2(0.00001)))

This is so small it won't really affect your training, it will just ensure the weights and biases don't get massive

Next I did

optimizer = optimizers.Adam(clipnorm=1., clipvalue=0.5)

As described above. Finally, I am using crossentropy loss so I changed it to this:

from keras import losses
def constrainedCrossEntropy(ytrue, ypred):
  ypred = T.clip(ypred, 0.0001, 0.99999)
  return losses.categorical_crossentropy(ytrue, ypred)

model.compile(loss=constrainedCrossEntropy, optimizer=optimizer)

Which ensures the values stay in a reasonable range, because if they get too close to 0 or 1 you will get nans

Edit: I had the parameters flipped for my constrainedCrossEntropy function, fixed that now

kielnino commented 7 years ago

I had also problems with train or val-loss turning to nan until I realized that my custom loss function was not capable of handling values bigger than 88 (because exp(89) is to big for float32).

from keras import backend as K

def binary_regression_error(y_true, y_pred):
    return K.mean(K.log(1 + K.exp(K.clip(-y_true*y_pred, -1e40, 88.))))

So clipping solved it for me.

dupsys commented 7 years ago

Hi guys, I don't know what to do anymore. I have all the solution given above but I still experience loss: nan and accuracy: nan with the very small batch of size:50. I am using GeForce GTX 680 with CuDNN version 5105. Below is the error with Tensorflow backend: 35/50 [====================>.........] - ETA: 13s - loss: 9.4006 - acc: 0.1041{'acc': 0.10833333, 'loss': 9.4530907, 'batch': 35, 'size': 32} 36/50 [====================>.........] - ETA: 12s - loss: 9.4020 - acc: 0.1043 i:96 {'acc': 0.110625, 'loss': 9.3898754, 'batch': 36, 'size': 32} 37/50 [=====================>........] - ETA: 11s - loss: 9.4017 - acc: 0.1044{'acc': 0.10916667, 'loss': 9.2677832, 'batch': 37, 'size': 32} 38/50 [=====================>........] - ETA: 10s - loss: 9.3982 - acc: 0.1046{'acc': 0.11254902, 'loss': 9.3335171, 'batch': 38, 'size': 17} 39/50 [======================>.......] - ETA: 9s - loss: 9.3965 - acc: 0.1048 {'acc': nan, 'loss': nan, 'batch': 39, 'size': 0} 40/50 [=======================>......] - ETA: 8s - loss: nan - acc: nan {'acc': nan, 'loss': nan, 'batch': 40, 'size': 0} 41/50 [=======================>......] - ETA: 7s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 41, 'size': 0} 42/50 [========================>.....] - ETA: 6s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 42, 'size': 0} 43/50 [========================>.....] - ETA: 5s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 43, 'size': 0} 44/50 [=========================>....] - ETA: 4s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 44, 'size': 0} 45/50 [==========================>...] - ETA: 3s - loss: nan - acc: nan{'acc': nan, 'loss': nan, 'batch': 45, 'size': 0}

I change the regularisation and customised loss function as follows: def constrainedCrossEntropy(x, y): x = T.clip(x, 0.0001, 0.99999) return losses.categorical_crossentropy(x, y)

Model

l_conv1 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu', \ kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001), input_shape=(seq_leng,VOCAB_SIZE))(inputs) l_pool1 = MaxPooling1D(pool_size=pooling_size,padding='same')(l_conv1) l_conv2 = Conv1D(filters, filter_length=filter_sizes[0], strides=1, padding='same', activation='relu', \ kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001)(l_pool1) l_pool2 = MaxPooling1D(pool_size=pooling_size, padding='same')(l_conv2)

l_conv3 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu', \ kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_pool2) l_conv4 = Conv1D(filters, filter_length=filter_sizes[1], strides=1, padding='same', activation='relu', \ kernel_regularizer=regularizers.l2(0.00001), bias_regularizer=regularizers.l2(0.00001))(l_conv3)

Please advise me on what to do. Thanks

MaratZakirov commented 7 years ago

@hr0nix As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

Just now I had problem with keras ctc loss function on the top of the softmax, I added one more tanh layer before softmax and NaNs are gone!

Phylliida commented 7 years ago

@MaratZakirov That's a really great point, though wouldn't sigmoid with some clipping to make sure it isn't 0 or 1 work better? Tanh can produce negative values which could give you nans again.

On Mon, Jun 26, 2017 at 7:33 AM MaratZakirov notifications@github.com wrote:

@hr0nix https://github.com/hr0nix As far as I know, it's the combination of relu and softmax that causes numerical troubles, as relu can produce large positive values corresponding to very small probabilities. If you change your model to use, say, tanh instead of relu for the last dense layer, the problem will go away.

Just now I had problem with keras ctc loss function on the top of the softmax, I added one more tanh layer before softmax and NaNs are gone!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/1244#issuecomment-311060180, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAaNEUXX2-h9-fP03Rd6xvfZUdlPkWiks5sH7M0gaJpZM4GzdUg .

varshini24 commented 7 years ago

I also faced the same issue with loss variable showing 'nan' while going deep into the layers.

But, I solved the problem by decaying the learning rate for every epoch.

brunez commented 7 years ago

I haven't looked deep into it, but I think this might have to do with the presence of zeros at some point.

The reason is a workaround I found, which seems pretty robust so far: I just added a layer with very small Gaussian noise after each of my layers. NaNs no more.

sun-peach commented 7 years ago

I also had this problem recently. I have tried loss clip, weight constraint, adding regularizer with small value. None one works. I am doing regression problem and use cuDNN and float64. I use Adam (tried RMSprop, still have this problem).

BTW, I do not control the last layer (the linear layer), Will that be a problem?