Closed indra215 closed 3 years ago
You can't use softmax with only a single output unit.
sry it's commented...there is no softmax layer at the output..i've updated the question
Have you tried reducing the batch size?
I sometimes get loss: nan with my LSTM networks for time-series regression, and I can nearly always avoid it either by reducing the sizes of my layers, or by reducing the batch size.
Have you tried using e.g., rmsprop instead of sgd? Usually worked better for me with regression.
A few comments. Your l2 regularizers are using pretty large terms, try something much smaller to start, i.e. l2(0.001)
or get rid of them altogether to see if that helps. You may be driving your weights to 0 too fast. Your dropout rates are also pretty high, generally people don't use above 0.5. Also for regression problems, I find larger batch sizes to be more useful, ~500.
I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.
Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.
Share my experience for benefit of others... One thing I found is the optimizer plays a role in nan loss issue. Changing from rmsprop to adam optimizer makes this problem go away for me when training a LSTM.
I recall I had such problems when I used SGD optimizer too, and also rmsprop. Try adam.
For future reference: NaN loss could come from any value in your dataset that is not float or int. In my case, there were some NumPy infinities (np.inf), resulting from divide by zero in my program that prepares the dataset. Checking for inf or nan data first may save you some time spent trying to find faults in the model.
@ctawong I'm using relu
for activation, categorical_crossentropy
for loss, and adam
for optimization and I'm getting nan
for the loss value
Optimizer selection was a major factor in my problem as well (image convolution with unbounded output - gradient explosion with SGD). My experience was that RMSprop with heavy regularization was effective in preventing gradient explosion, but that caused training to converge very slowly (many steps / epochs required).
Adam worked with no dropout / regularization and consequently converged very quickly. Whether no dropout / regularization is a good idea (as it helps prevent over-fitting) is a separate question but at least now I can determine the proper amount.
My recommendations regarding the issue:
df.isnull().any()
I spent literally hours on this problem, going through every possible suggestion. Then discovered that one column in my data set had all the same numerical value, making it effectively a worthless addition to the DNN. I'd recommend anyone to go right back to their data and don't make any assumptions or take anything for granted.
I have normalized my input data to [0,1] which solved the error where loss = nan
Changing from sgd to rmsprop solved the problem for me (linear regression problem)
I also had these problems. I tried everything mentioned above and nothing helped.
But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:
...different batches may have different sizes. The last batch of the epoch is commonly smaller than the others...
I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.
Not sure if this helps everybody but I still want to post what helped me.
I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer
with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan
loss and accuracy of 0.0000e+00
; however, utf-8 and utf-16 files were working! Breakthrough.
If you're performing textual analysis and getting nan
loss after trying these suggestions, use file -i {input}
(linux) or file -I {input}
(osx) to discover your file type. If you have ISO-8859-1
or us-ascii
, try converting to utf-8
or utf-16le
. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!
I had the loss = nan issue and I solved it by making sure the number of classes in the config and my dataset are the same. the default num classes was 92+1.
Hi,guys. I meet a weird question now.
Traing:10000 images Validation:2000 images nb_classes:8
example 1.
base_model = densenet121(weights='imagenet',include_top=False) x = base_model.output x = GlobalAveragePooling2D()(x)
x = Dense(1024,activation='relu')(x) predictions = Dense(8,activation='sigmoid')(x) model = Model(input=base_model.input,output=predictions) for layer in base_model.layers: layer.trainable = False model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy']) model.fit(X_train,Y_train,batch_size=batch_size, nb_epoch=nb_epoch,verbose=1,validation_data= (X_val,Y_val))
When i runing the code , the train-loss and val-loss are NAN. Then i change the network. example 2.
model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3),padding='same',input_shape=(channels,img_rows, img_cols))) model.add(Activation('relu')) model.add(Conv2D(32, (3, 3))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(nb_classes)) model.add(Activation('sigmoid')) model.compile(loss=multitask_loss,optimizer='adam',metrics=['accuracy']) model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,verbose=1 ,validation_data=(X_val, Y_val))
But when i runing this code , the train-loss and val-loss are normal. So i think it is the network while fine-tuning it. I want to use the pre-trained DenseNet now, and how can i solve the loss NaN. Thanks.
I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())
I hope this helps someone encountering similar problem
Hi
I put model.add(BatchNormalization()) after conv layers and works for me
In my case, it is the loss function. I used loss='sparse_categorical_crossentropy' and switched to loss=losses.mean_squared_error (from keras import losses). loss got normal.
I solved my "loss: nan" problem by fixing my annotations. I used a conversion script for annotations that changed some bounding boxes sizes to 0 width or height erroneously.
The problem can happen for several reasons, mine was because of the second item, 1) The existence of some NaNs, Null elements in the dataset. 2) Inequality between the number of classes and the corresponding labels.
I experience the same issue and wanted to share that in my case it wasn't one of the features that had the nan/inf value, it was actually an infinite Y value.
Hope that will help someone... FYI.
want to share my experience: when I tried to add more data to the LSTM network, the loss became NaN after some time. The solution is I set the batch size to 1 and found the 'bad' data sample that have some '0' to cause the loss function not converge. Hope helps!
Make sure you don't have any NaN (or string) in the dataset. I was having the same problem with regression. If you are using pandas know that dataframe.replace(np.nan, some_value)
function doesn't modify the dataframe that calls it but returns the modifed dataset. Instead, one should do:
new_dataframe = dataframe.replace(np.nan , some_value)
If still the loss is NaN, refer this link.
I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should the input model data after the standarization if you see you will have nan value:
print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))
you can solve this by adding a small value(0.000001) to Std like this ,
mean = np.mean(train, axis=0)
std = np.std(train, axis=0)+0.000001
X_train = (train - mean) / std
X_test = (test - mean) /std
I also had these problems. I tried everything mentioned above and nothing helped.
But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:
...different batches may have different sizes. The last batch of the epoch is commonly smaller than the others...
I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore.
Not sure if this helps everybody but I still want to post what helped me.
I am using a model.fit_generator()
in keras, I got the problem that: train_loss
is normal and decreasing, but the val_loss
is inf
. This is so strange, and I check everying and dont know why.
Finally, I change the code in customize data_generator
:
def __len__(self):
#return int(np.ceil(len(self.names) / float(self.batch_size)))
return int(np.floor(len(self.names) / float(self.batch_size)))
This means that drop the last batch in data_generator
, and this works!
In my case, the issue with the generator function that I was using was not handling the batches properly, so instead of returning data as (batchsize, items, height,width,channel) it was returning (0,items, height, width, channel). So I have to fix the logic of my generator to return data properly.
I had a categorical field where the column was only 0
. My norm function made this column None and raised this exception. Solution was to remove the column since there is only one category.
I was facing the same problem. I thought I had my inputs covered and carried out few suggestions mentioned here, but to no avail. When I inspected my inputs again, I had one value as NaN. When I took care of it, it worked.
I also had these problems. I tried everything mentioned above and nothing helped. But now I seem to have found a solution. I am using a fit-generator and in the Keras Documentation (fit_generator) it mentions that:
...different batches may have different sizes. The last batch of the epoch is commonly smaller than the others...
I still changed my generator to only output batches the right size. And voila, since then I dont get NaN and inf anymore. Not sure if this helps everybody but I still want to post what helped me.
I am using a
model.fit_generator()
in keras, I got the problem that:train_loss
is normal and decreasing, but theval_loss
isinf
. This is so strange, and I check everying and dont know why. Finally, I change the code in customizedata_generator
:def __len__(self): #return int(np.ceil(len(self.names) / float(self.batch_size))) return int(np.floor(len(self.names) / float(self.batch_size)))
This means that drop the last batch in
data_generator
, and this works!
Thank you so much! The NaN of val loss disappear. But my validation set is an integer multiple of the batch size. Why is this helpful?
Thanks to @eng-tsmith In my case, I had to use "fit_generator" instead of "fit", Then I realized that I was passing only one ground truth for each batch, so I fixed it.
All discussions talk NaN
ininput data
. I found my culprit in the output data
. Fixed it by removing NaN
from the regression target output.
All discussions talk
NaN
ininput data
. I found my culprit in theoutput data
. Fixed it by removingNaN
from the regression target output.
How did you do that inside the keras loss function?
I had the problem that my regularization loss became inf
(infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.
I had the problem that my regularization loss became
inf
(infinite). This was clear because the actual prediction loss was still a nice float. The reason for me was that my activity regularization, containing a squaring operation, was getting some very large values: larger than the square root of the maximal IEEE floating point value, such that, after squaring it, the result became (aka "was rounded to") infinity.
I had similar issues with my loss function, moreover the eigenvalue decomposition inside tensorflow for v 1.13.1 has the issue that if any singular value is encountered.. the loss function still remains NaN even if you filter Nan from the values and perform any aggregate operations.
in my case model. predic returns nan values. The model is already trained and i have saved weights. After compiling the model and loading weights the prediction is returning nan.
I was sometimes taking the log of a very small number somewhere in my cost function. I added a tiny amount of jitter to stop the output becoming -inf and a NaN being produced at the next update.
Hope this might help someone one day!
I had the same error for a multiclass problem i was dealing with. At first i had my output layer with only 1 node and it was giving me an loss of nan, so i changed the output layer to the number of classes i had and it worked!!!
I tried all the solutions mentioned above until I finally figured out that I put an intermediate backend layer to log transform a list that possibly contains 0 values, hope it helps.
I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the Keras custom image generator (Keras Sequence) I wrote was the culprit. While generating batches of images, I had initialized the numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0], self.resize_shape_tuple[1], self.num_channels))
. I changed that to np.zeros()
and it resolved the issue.
Retraining helped me to solve this problem.
On Mon, Feb 24, 2020 at 12:46 PM Vikas Nataraja notifications@github.com wrote:
I tried a lot of alternatives and it looks like the NaN loss can be caused by many different things. In my case, the custom image generator I wrote was the culprit. While generating batches of images, I had initialized the numpy array as np.empty((self.batch_size, self.resize_shape_tuple[0], self.resize_shape_tuple[1], self.num_channels)). I changed that to np.zeros() and it resolved the issue.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/2134?email_source=notifications&email_token=AGEUNATEXSPEMYVF33KXMCDRENX4HA5CNFSM4B7ONXMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMW2CPY#issuecomment-590192959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEUNAV6OW7IENBAR2CNRGLRENX4HANCNFSM4B7ONXMA .
--
Mr. Dattatray D Sawat.
Research Scholar,
Department of Computer science,
School of Computational Sciences,
Solapur University, Solapur-413255. Ph:+918007163397
Thanks @lhatsk, using RMSprop instead of SGD solved my problem
I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())
I hope this helps someone encountering similar problem
This is nice, but dropping the whole row because of a nan value is bad for timeseries problems. I found that using df.fillna(0) gets better results!
My problem was that my y(target) value was [1,2] not [0,1] this made my loss value to negative and eventually it made an nan loss value so check if your y value is correct
For me the core of the problem was, that I used "relu" as activation function in the LSTM-layer. I replaced "relu" with "tanh" and it worked fine.
I had this problem, and my model would predict "NaNs" on any data, even though my losses were decreasing normally. This probably means that there was corruption when it was processing the last batch. Therefore, I changed two things, I changed my model from outputting an activation to outputting a sequential layer (in mixed precision), although I don't think this was the cause of the problem. I also used the drop_remainder=True argument in Dataset.batch(). Now it doesn't mysteriously all go NaN after the first epoch. I'm not sure why this even happened, since it worked just fine with other activation functions.
In my case, the problem was the number of last output neuron doesn't match with the actual label size. Causing me headache for this bug. :)
I'm running a regression model on patches of size 32x32 extracted from images against a real value as the target value. I have 200,000 samples for training but during the first epoch itself, I'm encountering a nan loss. Can anyone help me solve this problem please ? I've tried on both GPU and CPU but the issue still appears.
model = Sequential()
model.add(Convolution2D(50, 7, 7, border_mode='valid',input_shape=(1, 32, 32))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(800, W_regularizer=l2(0.5)))
model.add(Activation('relu')) model.add(Dropout(0.7))
model.add(Dense(800,W_regularizer=l2(0.5)))
model.add(Activation('relu')) model.add(Dropout(0.7))
model.add(Dense(1))
sgd = SGD(lr=0.00001, decay=1e-6, momentum=0.9, nesterov=True,clipnorm=100) model.compile(loss='mean_squared_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=256, nb_epoch=40)