NAN loss for regression while training

indra215 commented 8 years ago

I'm running a regression model on patches of size 32x32 extracted from images against a real value as the target value. I have 200,000 samples for training but during the first epoch itself, I'm encountering a nan loss. Can anyone help me solve this problem please ? I've tried on both GPU and CPU but the issue still appears.

model = Sequential()

model.add(Convolution2D(50, 7, 7, border_mode='valid',input_shape=(1, 32, 32))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(800, W_regularizer=l2(0.5)))

model.add(Activation('relu')) model.add(Dropout(0.7))

model.add(Dense(800,W_regularizer=l2(0.5)))

model.add(Activation('relu')) model.add(Dropout(0.7))

model.add(Dense(1))

sgd = SGD(lr=0.00001, decay=1e-6, momentum=0.9, nesterov=True,clipnorm=100) model.compile(loss='mean_squared_error', optimizer=sgd)

model.fit(X_train, Y_train, batch_size=256, nb_epoch=40)

NasenSpray commented 8 years ago

You can't use softmax with only a single output unit.

indra215 commented 8 years ago

sry it's commented...there is no softmax layer at the output..i've updated the question

jpeg729 commented 8 years ago

Have you tried reducing the batch size?

I sometimes get loss: nan with my LSTM networks for time-series regression, and I can nearly always avoid it either by reducing the sizes of my layers, or by reducing the batch size.

lhatsk commented 8 years ago

Have you tried using e.g., rmsprop instead of sgd? Usually worked better for me with regression.

the-moliver commented 8 years ago

A few comments. Your l2 regularizers are using pretty large terms, try something much smaller to start, i.e. l2(0.001) or get rid of them altogether to see if that helps. You may be driving your weights to 0 too fast. Your dropout rates are also pretty high, generally people don't use above 0.5. Also for regression problems, I find larger batch sizes to be more useful, ~500.

cjnolet commented 7 years ago

I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.

Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.

andrewssdd commented 7 years ago

Share my experience for benefit of others... One thing I found is the optimizer plays a role in nan loss issue. Changing from rmsprop to adam optimizer makes this problem go away for me when training a LSTM.

ylmeng commented 7 years ago

I recall I had such problems when I used SGD optimizer too, and also rmsprop. Try adam.

mxh000 commented 7 years ago

For future reference: NaN loss could come from any value in your dataset that is not float or int. In my case, there were some NumPy infinities (np.inf), resulting from divide by zero in my program that prepares the dataset. Checking for inf or nan data first may save you some time spent trying to find faults in the model.

naisanza commented 7 years ago

@ctawong I'm using relu for activation, categorical_crossentropy for loss, and adam for optimization and I'm getting nan for the loss value

brittohalloran commented 7 years ago

Optimizer selection was a major factor in my problem as well (image convolution with unbounded output - gradient explosion with SGD). My experience was that RMSprop with heavy regularization was effective in preventing gradient explosion, but that caused training to converge very slowly (many steps / epochs required).

Adam worked with no dropout / regularization and consequently converged very quickly. Whether no dropout / regularization is a good idea (as it helps prevent over-fitting) is a separate question but at least now I can determine the proper amount.

unnir commented 6 years ago

My recommendations regarding the issue:

try different optimizers, f.e. sgd, nadam, adam...
scale you data differently, f.e. try these ranges [0,1] or [-1,1],
Also, in my case the learning rate parameter was the critical one.
and the most important thing: always check for NaNs or inf in your dataset. You can do it like this:
```
df.isnull().any() 
```

gavriel-merkado commented 6 years ago

I spent literally hours on this problem, going through every possible suggestion. Then discovered that one column in my data set had all the same numerical value, making it effectively a worthless addition to the DNN. I'd recommend anyone to go right back to their data and don't make any assumptions or take anything for granted.

Mboga commented 6 years ago

I have normalized my input data to [0,1] which solved the error where loss = nan

emoen commented 6 years ago

Changing from sgd to rmsprop solved the problem for me (linear regression problem)

eng-tsmith commented 6 years ago

I also had these problems. I tried everything mentioned above and nothing helped.

But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:

...different batches may have different sizes. The last batch of the epoch is commonly smaller than the others...

I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.

Not sure if this helps everybody but I still want to post what helped me.

claycoleman commented 6 years ago

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.

If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

AloshkaD commented 6 years ago

I had the loss = nan issue and I solved it by making sure the number of classes in the config and my dataset are the same. the default num classes was 92+1.

alyato commented 6 years ago

Hi,guys. I meet a weird question now.

Traing:10000 images Validation:2000 images nb_classes:8

example 1.

base_model = densenet121(weights='imagenet',include_top=False) x = base_model.output x = GlobalAveragePooling2D()(x)
x = Dense(1024,activation='relu')(x) predictions = Dense(8,activation='sigmoid')(x) model = Model(input=base_model.input,output=predictions) for layer in base_model.layers: layer.trainable = False model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy']) model.fit(X_train,Y_train,batch_size=batch_size, nb_epoch=nb_epoch,verbose=1,validation_data= (X_val,Y_val))

When i runing the code , the train-loss and val-loss are NAN. Then i change the network. example 2.

model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3),padding='same',input_shape=(channels,img_rows, img_cols))) model.add(Activation('relu')) model.add(Conv2D(32, (3, 3))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(nb_classes)) model.add(Activation('sigmoid')) model.compile(loss=multitask_loss,optimizer='adam',metrics=['accuracy']) model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,verbose=1 ,validation_data=(X_val, Y_val))

But when i runing this code , the train-loss and val-loss are normal. So i think it is the network while fine-tuning it. I want to use the pre-trained DenseNet now, and how can i solve the loss NaN. Thanks.

Krithi07 commented 6 years ago

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())

I hope this helps someone encountering similar problem

ghost commented 6 years ago

Hi

I put model.add(BatchNormalization()) after conv layers and works for me

yosunpeng commented 6 years ago

In my case, it is the loss function. I used loss='sparse_categorical_crossentropy' and switched to loss=losses.mean_squared_error (from keras import losses). loss got normal.

DracoScript commented 6 years ago

I solved my "loss: nan" problem by fixing my annotations. I used a conversion script for annotations that changed some bounding boxes sizes to 0 width or height erroneously.

Sorooshi commented 6 years ago

The problem can happen for several reasons, mine was because of the second item, 1) The existence of some NaNs, Null elements in the dataset. 2) Inequality between the number of classes and the corresponding labels.

rbahumi commented 5 years ago

I experience the same issue and wanted to share that in my case it wasn't one of the features that had the nan/inf value, it was actually an infinite Y value.

Hope that will help someone... FYI.

jingzhao3200 commented 5 years ago

want to share my experience: when I tried to add more data to the LSTM network, the loss became NaN after some time. The solution is I set the batch size to 1 and found the 'bad' data sample that have some '0' to cause the loss function not converge. Hope helps!

NikhilShaw commented 5 years ago

Make sure you don't have any NaN (or string) in the dataset. I was having the same problem with regression. If you are using pandas know that dataframe.replace(np.nan, some_value) function doesn't modify the dataframe that calls it but returns the modifed dataset. Instead, one should do: new_dataframe = dataframe.replace(np.nan , some_value) If still the loss is NaN, refer this link.

rebeen commented 5 years ago

I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should the input model data after the standarization if you see you will have nan value:

print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))

you can solve this by adding a small value(0.000001) to Std like this ,

mean = np.mean(train, axis=0)
std = np.std(train, axis=0)+0.000001

X_train = (train - mean) / std
X_test = (test - mean) /std

liminn commented 5 years ago

I also had these problems. I tried everything mentioned above and nothing helped.

But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:

...different batches may have different sizes. The last batch of the epoch is commonly smaller than the others...

I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.

Not sure if this helps everybody but I still want to post what helped me.

I am using a model.fit_generator() in keras, I got the problem that: train_loss is normal and decreasing, but the val_loss is inf. This is so strange, and I check everying and dont know why. Finally, I change the code in customize data_generator:

def __len__(self):
        #return int(np.ceil(len(self.names) / float(self.batch_size)))
        return int(np.floor(len(self.names) / float(self.batch_size)))

This means that drop the last batch in data_generator, and this works!

gagantewari commented 5 years ago

In my case, the issue with the generator function that I was using was not handling the batches properly, so instead of returning data as (batchsize, items, height,width,channel) it was returning (0,items, height, width, channel). So I have to fix the logic of my generator to return data properly.

chmoder commented 5 years ago

I had a categorical field where the column was only 0. My norm function made this column None and raised this exception. Solution was to remove the column since there is only one category.

vijayakumar-govindarajulu commented 5 years ago

I was facing the same problem. I thought I had my inputs covered and carried out few suggestions mentioned here, but to no avail. When I inspected my inputs again, I had one value as NaN. When I took care of it, it worked.

ZhuoyaYang commented 5 years ago

I also had these problems. I tried everything mentioned above and nothing helped. But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:

...different batches may have different sizes. The last batch of the epoch is commonly smaller than the others...

I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore. Not sure if this helps everybody but I still want to post what helped me.

I am using a model.fit_generator() in keras, I got the problem that: train_loss is normal and decreasing, but the val_loss is inf. This is so strange, and I check everying and dont know why. Finally, I change the code in customize data_generator:
def __len__(self):
        #return int(np.ceil(len(self.names) / float(self.batch_size)))
        return int(np.floor(len(self.names) / float(self.batch_size)))
This means that drop the last batch in data_generator, and this works!

Thank you so much! The NaN of val loss disappear. But my validation set is an integer multiple of the batch size. Why is this helpful?

p30arena commented 5 years ago

Thanks to @eng-tsmith In my case, I had to use "fit_generator" instead of "fit", Then I realized that I was passing only one ground truth for each batch, so I fixed it.

mdalvi commented 4 years ago

All discussions talk NaN ininput data. I found my culprit in the output data. Fixed it by removing NaN from the regression target output.

khums commented 4 years ago

All discussions talk NaN ininput data. I found my culprit in the output data. Fixed it by removing NaN from the regression target output.

How did you do that inside the keras loss function?

mcourteaux commented 4 years ago

I had the problem that my regularization loss became inf (infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.

khums commented 4 years ago

I had the problem that my regularization loss became inf (infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.

I had similar issues with my loss function, moreover the eigenvalue decomposition inside tensorflow for v 1.13.1 has the issue that if any singular value is encountered.. the loss function still remains NaN even if you filter Nan from the values and perform any aggregate operations.

Sawatdatta commented 4 years ago

in my case model. predic returns nan values. The model is already trained and i have saved weights. After compiling the model and loading weights the prediction is returning nan.

RCTimms commented 4 years ago

I was sometimes taking the log of a very small number somewhere in my cost function. I added a tiny amount of jitter to stop the output becoming -inf and a NaN being produced at the next update.

Hope this might help someone one day!

raykipa commented 4 years ago

I had the same error for a multiclass problem i was dealing with. At first i had my output layer with only 1 node and it was giving me an loss of nan, so i changed the output layer to the number of classes i had and it worked!!!

nishuai commented 4 years ago

I tried all the solutions mentioned above until I finally figured out that I put an intermediate backend layer to log transform a list that possibly contains 0 values, hope it helps.

vikasnataraja commented 4 years ago

I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the Keras custom image generator (Keras Sequence) I wrote was the culprit. While generating batches of images, I had initialized the numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0], self.resize_shape_tuple[1], self.num_channels)). I changed that to np.zeros() and it resolved the issue.

Sawatdatta commented 4 years ago

Retraining helped me to solve this problem.

On Mon, Feb 24, 2020 at 12:46 PM Vikas Nataraja notifications@github.com wrote:

I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the custom image generator I wrote was the culprit. While generating batches of images, I had initialized the numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0], self.resize_shape_tuple[1], self.num_channels)). I changed that to np.zeros() and it resolved the issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/2134?email_source=notifications&email_token=AGEUNATEXSPEMYVF33KXMCDRENX4HA5CNFSM4B7ONXMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMW2CPY#issuecomment-590192959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEUNAV6OW7IENBAR2CNRGLRENX4HANCNFSM4B7ONXMA .

--

Mr. Dattatray D Sawat.

Research Scholar,

Department of Computer science,

School of Computational Sciences,

Solapur University, Solapur-413255. Ph:+918007163397

yasersakkaf commented 4 years ago

Thanks @lhatsk, using RMSprop instead of SGD solved my problem

lambdavar commented 4 years ago

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())

I hope this helps someone encountering similar problem

This is nice, but dropping the whole row because of a nan value is bad for timeseries problems. I found that using df.fillna(0) gets better results!

Ldoun commented 4 years ago

My problem was that my y(target) value was [1,2] not [0,1] this made my loss value to negative and eventually it made an nan loss value so check if your y value is correct

SevenUp92 commented 4 years ago

For me the core of the problem was, that I used "relu" as activation function in the LSTM-layer. I replaced "relu" with "tanh" and it worked fine.

taesookim0412 commented 3 years ago

I had this problem, and my model would predict "NaNs" on any data, even though my losses were decreasing normally. This probably means that there was corruption when it was processing the last batch. Therefore, I changed two things, I changed my model from outputting an activation to outputting a sequential layer (in mixed precision), although I don't think this was the cause of the problem. I also used the drop_remainder=True argument in Dataset.batch(). Now it doesn't mysteriously all go NaN after the first epoch. I'm not sure why this even happened, since it worked just fine with other activation functions.

saihtaungkham commented 3 years ago

In my case, the problem was the number of last output neuron doesn't match with the actual label size. Causing me headache for this bug. :)

keras-team / keras

NAN loss for regression while training #2134