keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.91k stars 19.45k forks source link

model.predict() gives same output for all inputs #6447

Closed schmolze closed 5 years ago

schmolze commented 7 years ago

I'm trying to learn a regression problem. The data is mostly one-hot encoded categorical variables, one continuous. The target output is a probability (0-1). Here is the code:

def read_lines(filename):
    lines = []

    with open(filename) as file:
        for line in file:
            line = line.strip()
            lines.append(line)

    return lines

# read target survival probabilities and patient IDs
targets = np.loadtxt("../survival/target_probs.txt", delimiter=",")

all_patient_ids = read_lines("../survival/target_patient_ids.txt")

# read available patient IDs and variable names
patient_ids = read_lines("clinical_patient_ids.txt")
var_names = read_lines("clinical_var_names.txt")

# only use available cases
pt_idxs = [all_patient_ids.index(x) for x in patient_ids]

targets = targets[pt_idxs]

# determine number of cases for 60/10/30 train/val/test split
n_cases = len(patient_ids)
n_train = int(round(n_cases*.6))
n_val = int(round(n_cases*.1))
n_test = int(round(n_cases*.3))

# extract training, val, and test patient IDs
train_patient_ids = patient_ids[:n_train]
val_patient_ids = patient_ids[n_train:n_train+n_val]
test_patient_ids = patient_ids[n_train+n_val:]

Y_train = targets[:n_train]
Y_val = targets[n_train:n_train+n_val]
Y_test = targets[n_train+n_val:n_cases]

# load data
data = np.loadtxt("clinical_data.txt", delimiter=",")

# preprocess
min_max_scaler = preprocessing.MinMaxScaler()

min_max_scaler.fit(data)

data = min_max_scaler.transform(data)

# set up  model architecture
model = Sequential()

model.add(Dense(32, activation="relu", input_dim=len(var_names),
    kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.2))
model.add(Dense(20, activation="relu", kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.2))
model.add(Dense(16, activation="relu", kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.2))
model.add(Dense(16, activation="relu", kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.2))
model.add(Dense(1, activation="sigmoid", kernel_regularizer=regularizers.l2(0.01)))

X_train = data[:n_train]
X_val = data[n_train:n_train+n_val]
X_test = data[n_train+n_val:n_cases]

# train on clinical data
early_stop = EarlyStopping(monitor="val_loss", patience=5)

reduce_lr = ReduceLROnPlateau(monitor="val_loss", factor=0.2,
    patience=2, min_lr=0.001)

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)

model.compile(loss="mse", optimizer="sgd", metrics=["mse"])

hist = model.fit(X_train, Y_train, validation_data=(X_val, Y_val), 
    epochs=1000, batch_size=32, callbacks=[reduce_lr, early_stop])

preds = model.predict(X_test)

evals = model.evaluate(X_test, Y_test)

print(evals)
print(preds[0:10])

# summarize history for loss
plt.plot(hist.history["loss"])
plt.plot(hist.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "val"], loc="upper left")
plt.show()

It sure seems to be learning something: hist

But print(preds[0:10]) gives:

[[ 0.87765867]
 [ 0.87765765]
 [ 0.87766296]
 [ 0.87765878]
 [ 0.87765783]
 [ 0.87765902]
 [ 0.87765855]
 [ 0.87765938]
 [ 0.87766141]
 [ 0.87766016]]

Even though print(evals) gives a loss and mse of: [0.012566652174742577, 0.0076035054909729212]

It even does that when I call model.predict() on training data.

I've tried no regularization, more regularization, different optimizers, different learning rates, mean/std normalization, less depth, more depth, all with the same result.

Any ideas?

fchollet commented 7 years ago

model.predict() gives same output for all inputs It sure seems to be learning something:

I would assume that is precisely what the model is learning: to predict the same "optimal" output regardless of the input.

schmolze commented 7 years ago

Right, I figure it's probably learning the average target value for the training cases, or something similar, it just seems odd that the val/test MSE is then so low as well, but perhaps my data has a similar average for any given batch of cases.

santoshaditham commented 7 years ago

More epochs helped in my case. Not a real solution for @schmolze but something to try if others have the same issue.

I thought I had the same problem. I am performing multi-label classification. Whether I use the predict() method or the predict_classes() method, I got the same output prediction for every test case. But when I increased the number of epochs from 10 to 250, my model was no longer suffering from this problem.

ohernpaul commented 7 years ago

I'm using an many to many LSTM for categorical prediction of images that have been sub-sampled (tiled). While training the acc and val_acc hit 100% and the loss and val_loss decrease to 0.03 over 100 epochs. I use model.predict() on the training and validation set, getting 100% prediction accuracy, then feed in a quarantined/shuffled set of tiled images and get 33% prediction accuracy every time.

Even after shuffling and making another prediction, the outputs are exactly the same (same sequence of classes predicted). Not sure what to do.

opt = Adam(lr=.00001)
model.add(LSTM(200,input_shape=(100, 3600), return_sequences=True, init='he_normal', 
                                                         inner_init='he_normal'))
model.add(Flatten())
model.add(Dense(6))#output size, each number being a future value. 1 is what we want
model.add(Activation('sigmoid'))
model.compile(loss='categorical_crossentropy',
                         optimizer=opt,
                         metrics=['accuracy'])

model.fit(trainX, 
               trainY,
               validation_data=(testX,testY),
               validation_split=0.8,
               nb_epoch=250,
               shuffle=True,
               batch_size=402,
               verbose=1)
ohernpaul commented 7 years ago

Update:

I compared the prediction using 250 epochs to a 15 epoch prediction. There are different predictions, but still the same accuracy of 33%?

pablo14 commented 6 years ago

@schmolze if it helps, I started to fix this by adding validation_split=0.4.

saketkarve commented 6 years ago

Even I am facing a similar issue.

I am using a sequence to sequence model to extract key-phrases from a text document. I have trained the stacked LSTM model (for encoding) over 15 text documents (I have a dataset of 450 documents, but because of limitations in RAM available we are using a very small dataset for validation purpose).

The following model is used to train the encoder and the encoder,

encoder_inputs = Input(shape=(tx, f))
encoder1 = LSTM(lstm_ip[0], return_sequences = True)(encoder_inputs)
encoder2 = LSTM(lstm_ip[1], return_sequences = True)(encoder1)
encoder3 = LSTM(lstm_ip[2], return_state = True)
encoder_outputs, state_h, state_c = encoder3(encoder2)
encoder_states = [state_h, state_c]

#DECODER
decoder_inputs = Input(shape=(None, vs_op))
decoder_lstm = LSTM(lstm_ip[2], return_sequences = True, return_state = True)
decoder_outputs,_,_ = decoder_lstm(decoder_inputs, initial_state = encoder_states)
decoder_dense = Dense(vs_op, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

To test (inference) the following model is used

#ENCODER
encoder_model = Model(encoder_inputs, encoder_states)

#DECODER
decoder_state_input_h = Input(shape=(lstm_ip[2],))
decoder_state_input_c = Input(shape=(lstm_ip[2],))

decoder_state_input = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state = decoder_state_input)

decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_state_input, [decoder_outputs] + decoder_states)

This is how I have compiled and trained the data

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_output_data,
          batch_size=8,
          epochs=50)

model.predict() seems to give the same output irrespective of the input as we are getting exact same results for different inputs. I am calling the predict() for the same data the model is trained on. Is it that the data we have used for training small which is causing the problem? Or is the number of epochs less?

aarif96 commented 6 years ago

model.predict() seems to be giving me the same output irrespective of the input . However, model.predict_on_batch seems to be giving me proper results . Is there any reason for this ?

labrax commented 6 years ago

Hi @aarif96 , have you found any solutions on this?

varunranga commented 6 years ago

Try reducing the batch size. You may have a smaller dataset for the given problem.

labrax commented 6 years ago

Hi @varunranga , I do have a small dataset, however, I am using a generator, which should help me. I tried reducing the batch size to no effect. Do you have more ideas?

tragu commented 6 years ago

@saketkarve I am also facing the same problem what should I do?

labrax commented 6 years ago

@tragu in my case: reduced the number of output classes, increased the number of samples of each class and fixed inconsistencies using a pre-trained model

chaoqing commented 6 years ago

I also encountered this kind of problem before, it turns out that there is one NaN value in my dataset. Another fix might be scale your dataset first.

aarif96 commented 6 years ago

@labrax no i haven't found any solutions regarding this.

AsimFayyazRaja commented 6 years ago

I got this issue in a dense model in keras, which was solved by using more neurons, more layers and adding more dropout. Also lowering learning rate almost always helps.

ambader commented 6 years ago

Also had this problem. Im new to the topic, so i relied on snippets from samples i found. All of them had an activation function in the input layer, as well as yours. Im far from being an expert, but thats different than all the theorie I read about nn. When I did remove the activation from input layer, my model finally produced varying outputs. If that only works because it produces a faulty architecture, someone with a more solid background may explain me why.

TAUFEEQ1 commented 6 years ago

Same problem in my network, so far am trying to run it with less fully connected layers after the resNet50

TAUFEEQ1 commented 6 years ago

Same problem in my network, so far am trying to run it with less fully connected layers after the resNet50

this solution has worked!

wt-huang commented 5 years ago

Closing as this is resolved

hasan-kamal commented 5 years ago

I also encountered this kind of problem before, it turns out that there is one NaN value in my dataset. Another fix might be scale your dataset first.

Scaling the dataset fixed this for me.

mishaAscend commented 5 years ago

I also encountered this kind of problem before, it turns out that there is one NaN value in my dataset. Another fix might be scale your dataset first.

Scaling helped me . Thanks !

e2718281 commented 5 years ago

Scaling solved this problem for me as well.

ShubhraDeshpande commented 5 years ago

My model is giving almost 70% accuracy even for validation. there is no NaN value in dataset and it predicted the exact same output for any data. I tried on test and train data as well. Scaling didn't help. I even normalized the data. What can be done?

falut

TAUFEEQ1 commented 5 years ago

Consider that this might actually be the best accuracy on your data set.

One thing you could try is augmentating your data...more data helps. Another,try to reduce your learning rate...it helps allow the model learn more. If you dont reduce the learning rate, change the batch size...reduce it.

lastly shuffling the training data during can also help.

On Mon, Apr 8, 2019, 08:36 Shubhra Deshpande notifications@github.com wrote:

My model is giving almost 70% accuracy even for validation. there is no NaN value in dataset and it predicted the exact same output for any data. I tried on test and train data as well. Scaling didn't help. I even normalized the data. What can be done?

[image: falut] https://user-images.githubusercontent.com/44817120/55700744-8609ba00-599e-11e9-8432-8329c7ea748d.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/6447#issuecomment-480689937, or mute the thread https://github.com/notifications/unsubscribe-auth/ATTiheimYWqvUBLKIDTnFYvSvzlNOv9Eks5vetVRgaJpZM4NMbkD .

ShubhraDeshpande commented 5 years ago

Consider that this might actually be the best accuracy on your data set. One thing you could try is augmentating your data...more data helps. Another,try to reduce your learning rate...it helps allow the model learn more. If you dont reduce the learning rate, change the batch size...reduce it. lastly shuffling the training data during can also help. On Mon, Apr 8, 2019, 08:36 Shubhra Deshpande @.***> wrote: My model is giving almost 70% accuracy even for validation. there is no NaN value in dataset and it predicted the exact same output for any data. I tried on test and train data as well. Scaling didn't help. I even normalized the data. What can be done? [image: falut] https://user-images.githubusercontent.com/44817120/55700744-8609ba00-599e-11e9-8432-8329c7ea748d.png — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6447 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ATTiheimYWqvUBLKIDTnFYvSvzlNOv9Eks5vetVRgaJpZM4NMbkD .

Thank you. Reducing batch size and augmenting data actually solved the issue. It wasn't the best accuracy perhaps but I think my notebook was autosaved at a certain time. Restarting the whole notebook worked. Thanks

gopalnitp commented 5 years ago

in my case Reducing batch size does not change my output of problem no #6447 there is any other thing i can do ?

shaneh1 commented 5 years ago

After banging my head against the wall for over an hour with this same problem, and having tried all of the suggestions here without success, I eventually found the issue in my training data. I had an error in my normalization function and as a result, I had negative values in the training set. If none of the above suggestions work, this might be worth looking at. Cheers, Shane.

Astromsoc commented 5 years ago

Hi, I'm sorry to bring this problem up again but my attempts of all solutions above do not meet ideal effects in the end.

I'm currently coping with audio inputs and planning to output one single number with my regression network. However, after scaling the inputs, reducing the batch size(even to 1), augmenting the data, re-sampling the data to be more evenly distributed, and even reducing the model to only one hidden layer, I displayed the variance of predicted results and found out that it shrank to zero half way in the first epoch and never came back to non-negative values(which indicates that the model generates only one prediction again).

I also tried to reduce the size of training data to let it overfit, but the outcome was eerie. Attempts have also been implemented to include the penalty over predicting the only one same value into loss function, but even though sometimes by luck there's non-zero variance in the predicted values, the error(or more specific in my regression model, mse) gets increasingly lower through epochs.

Could there be any further solutions to such an issue? Thank you so much for your attention and help in advance.

TAUFEEQ1 commented 5 years ago

Just suggesting a couple of pointers.

  1. Try to make sure the data is balanced, by that I mean, make sure all classes are well represented.
  2. Consider using RNNs if you aren't, choose or use models with LSTMs. 3.Use a shallower network first and use the deeper ones later. 4.Perhaps more importantly, make sure you are using the right activation functions.

On Tue, 9 Jul 2019, 22:16 xxx041, notifications@github.com wrote:

Hi, I'm sorry to bring this problem up again but my attempts of all solutions above do not meet ideal effects in the end.

I'm currently coping with audio inputs and planning to output one single figure with my regression network. However, after scaling the inputs, reducing the batch size(even to 1), augmenting the data, re-sampling the data to be more evenly distributed, and even reducing the model to only one hidden layer, I displayed the variance of predicted results and found out that it shrank to zero half way in the first epoch and never came back to non-negative values(which indicates that the model generates only one prediction again).

I also tried to reduce the size of training data to let it overfit, but the outcome was eerie. Attempts have also been implemented to include the penalty over predicting the only one same value into loss function, but even though sometimes by luck there's non-zero variance in the predicted values, the error(or more specific in my regression model, mse) gets increasingly lower through epochs.

Could there be any further solutions to such an issue? Thank you so much for your attention and help in advance.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/6447?email_source=notifications&email_token=AE2OFBJ74JYO3N2B4ZHUCLDP6TPYHA5CNFSM4DJRXEB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRHYDA#issuecomment-509770764, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2OFBNIZ4653MRMI5PLP2TP6TPYHANCNFSM4DJRXEBQ .

Astromsoc commented 5 years ago

Just suggesting a couple of pointers. 1. Try to make sure the data is balanced, by that I mean, make sure all classes are well represented. 2. Consider using RNNs if you aren't, choose or use models with LSTMs. 3.Use a shallower network first and use the deeper ones later. 4.Perhaps more importantly, make sure you are using the right activation functions. On Tue, 9 Jul 2019, 22:16 xxx041, @.***> wrote: Hi, I'm sorry to bring this problem up again but my attempts of all solutions above do not meet ideal effects in the end. I'm currently coping with audio inputs and planning to output one single figure with my regression network. However, after scaling the inputs, reducing the batch size(even to 1), augmenting the data, re-sampling the data to be more evenly distributed, and even reducing the model to only one hidden layer, I displayed the variance of predicted results and found out that it shrank to zero half way in the first epoch and never came back to non-negative values(which indicates that the model generates only one prediction again). I also tried to reduce the size of training data to let it overfit, but the outcome was eerie. Attempts have also been implemented to include the penalty over predicting the only one same value into loss function, but even though sometimes by luck there's non-zero variance in the predicted values, the error(or more specific in my regression model, mse) gets increasingly lower through epochs. Could there be any further solutions to such an issue? Thank you so much for your attention and help in advance. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6447?email_source=notifications&email_token=AE2OFBJ74JYO3N2B4ZHUCLDP6TPYHA5CNFSM4DJRXEB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRHYDA#issuecomment-509770764>, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2OFBNIZ4653MRMI5PLP2TP6TPYHANCNFSM4DJRXEBQ .

Thank you so much for your super timely and effective answer!

I'm not quite sure if I understand your points right, but I actually tried some variants of them before, and here are a few responses to your kind suggestions:

  1. Since the data I have at hand are for regression models as well, they do not fall under specific classes but only have different values. And as you said correctly, they are not quite evenly distributed(most larger than the rest). According to my understanding and attempts, I try to "re-sample" the data by using the smaller numbers more frequently in training. Nonetheless, the only difference is that the predicted number changes in value(just like the different mean values we obtain by applying different weights to data). I think the problem could be more likely within my model.

  2. The current model I'm using is composed fully of BLSTM layers (and of course a dense layer in the end). I referred to the excellent paper of Alex Graves(Framewise phoneme classification with bidirectional LSTM and other neural network architectures), so I also revised the last dense layer to be preceded by a TimeDistributed Dense layer and hence a Lambda layer to obtain the mean values that connected to the ultimate Dense(1) layer. However, there's no improvement at all.

  3. Currently the simplification of the model shall be my major breakthroughs to be made here, but even with one single BLSTM of 8 units failed to give me sanguine outcomes. I'm still working to figure out the reason that led me to the bottleneck!

  4. The activation I'm using after each BLSTM layer is the rectified one(default). I did add a clipped ReLU after the final Dense layer to ensure the output does not exceed the range it is supposed to stay in. According to your nice suggestion, I removed it, but within the first few epochs I monitored, the predictions converge to the mean value even from Epoch 2, and hence this clipped ReLU is not working to confine the model at all.

To specify my model more clearly, I am using batch training. But because of the varied lengths of audio inputs, I padded zeros to the end of audios except the longest one in each batch. I'm wondering if it is the zeros padded that cause such a mess.

Thank you so much again for your warm replies! They are all helpful and I'm still working to implement them fully in depth.

jomayer commented 5 years ago

I had a similar issue, the problem on my end is I had too many neurons per layer, causing over-saturation of the model. I reduced the number of neurons by a large amount and that seemed to fix it on my end.

duygusar commented 5 years ago

Same problem here!

I am training a small network and the training seems to go fine, the val loss decreases, I reach validation accuracy around 80, and it actually stops training once there is no more improvement (patience=10). It trained for 40 epochs. However, it keeps predicting only one class for every test image!

I tried to initialize the conv layers randomly, I added regularizers, I switched from Adam to SGD, I added clipvalue, I added dropouts, I downsized the network. I also switched to softmax (I have only two labels but I saw some recommendation on using softmax and Dense layer with 2 neurons). Some or one of these helped with the overfitting, but nothing worked for the prediction problem.

The data is very balanced (for training I have 32354 / 31681 for val I have 9092 / 9860 samples per class), so it doesn't make sense that it reaches 80% if it predicts the same labels for evaluation set as well.

What is wrong with my model and how can I fix it? Is it the model, is it a bug or am I doing something wrong with predictions? Any comments are welcome.

#Import some packages to use
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.image import ImageDataGenerator
import os
from keras.regularizers import l2
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.initializers import RandomNormal

os.environ["CUDA_VISIBLE_DEVICES"]="0"

epochs = 200
callbacks = []
#schedule = None
decay = 0.0

earlyStopping = EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='min')
mcp_save = ModelCheckpoint('.mdl_wts.hdf5', save_best_only=True, monitor='val_loss', mode='min')
reduce_lr_loss = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1, epsilon=1e-5, mode='min')

train_dir = '/home/d/Desktop/s/data/train'
eval_dir = '/home/d/Desktop/s/data/eval'
test_dir = '/home/d/Desktop/s/data/test'

# create a data generator
train_datagen = ImageDataGenerator(rescale=1./255,   #Scale the image between 0 and 1
                                    rotation_range=40,
                                    width_shift_range=0.2,
                                    height_shift_range=0.2,
                                    shear_range=0.2,
                                    zoom_range=0.2,
                                    horizontal_flip=True,)

val_datagen = ImageDataGenerator(rescale=1./255)  #We do not augment validation data. we only perform rescale

test_datagen = ImageDataGenerator(rescale=1./255)  #We do not augment validation data. we only perform rescale

# load and iterate training dataset
train_generator = train_datagen.flow_from_directory(train_dir,  target_size=(224,224),class_mode='categorical', batch_size=16, shuffle='True', seed=42)
# load and iterate validation dataset
val_generator = val_datagen.flow_from_directory(eval_dir,  target_size=(224,224),class_mode='categorical', batch_size=16, shuffle='True', seed=42)
# load and iterate test dataset
test_generator = test_datagen.flow_from_directory(test_dir,  target_size=(224,224), class_mode=None, batch_size=1, shuffle='False', seed=42)
#We will use a batch size of 32. Note: batch size should be a factor of 2.***4,8,16,32,64...***
#batch_size = 4

#from keras import layers
from keras import models
from keras import optimizers
#from keras.layers import Dropout
#from keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing.image import img_to_array, load_img

model = models.Sequential()
model.add(Conv2D(64, (3, 3), activation='relu', name='block1_conv1', kernel_initializer=RandomNormal(
        mean=0.0, stddev=0.05), bias_initializer=RandomNormal(mean=0.0, stddev=0.05), input_shape=(224, 224, 3)))
model.add(Conv2D(64, (3, 3), activation='relu', name='block1_conv2', kernel_initializer=RandomNormal(
        mean=0.0, stddev=0.05), bias_initializer=RandomNormal(mean=0.0, stddev=0.05)))
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(128, (3, 3), activation='relu', name='block2_conv1', kernel_initializer=RandomNormal(
        mean=0.0, stddev=0.05), bias_initializer=RandomNormal(mean=0.0, stddev=0.05)))
model.add(Conv2D(128, (3, 3), activation='relu', name='block2_conv2',kernel_initializer=RandomNormal(
        mean=0.0, stddev=0.05), bias_initializer=RandomNormal(mean=0.0, stddev=0.05)))
model.add(MaxPooling2D((2, 2), name='block2_pool'))
model.add(Dropout(0.2))
model.add(Conv2D(256, (3, 3), activation='relu', name='block3_conv1', kernel_initializer=RandomNormal(
        mean=0.0, stddev=0.05), bias_initializer=RandomNormal(mean=0.0, stddev=0.05)))
model.add(Conv2D(256, (3, 3), activation='relu', name='block3_conv2', kernel_initializer=RandomNormal(
        mean=0.0, stddev=0.05), bias_initializer=RandomNormal(mean=0.0, stddev=0.05)))
model.add(Conv2D(256, (3, 3), activation='relu', name='block3_conv3', kernel_initializer=RandomNormal(
        mean=0.0, stddev=0.05), bias_initializer=RandomNormal(mean=0.0, stddev=0.05)))
model.add(MaxPooling2D((2, 2), name='block3_pool'))
model.add(Dropout(0.2))
#model.add(layers.Conv2D(512, (3, 3), activation='relu', name='block4_conv1'))
#model.add(layers.Conv2D(512, (3, 3), activation='relu', name='block4_conv2'))
#model.add(layers.Conv2D(512, (3, 3), activation='relu', name='block4_conv3'))
#model.add(layers.MaxPooling2D((2, 2), name='block4_pool'))
model.add(Flatten())
model.add(Dense(256, kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01), activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(0.5))
model.add(Dense(2, kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01), activation='softmax'))

#Lets see our model
model.summary()

#We'll use the RMSprop optimizer with a learning rate of 0.0001
#We'll use binary_crossentropy loss because its a binary classification
#model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-5, momentum=0.9), metrics=['acc'])
model.compile(loss='categorical_crossentropy',
                   #optimizer=optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=1e-08, decay=decay),
                    optimizer=optimizers.SGD(lr= 0.0001, clipvalue = 0.5, decay=1e-6, momentum=0.9, nesterov=True),
              metrics=['accuracy'])

#The training part
#We train for 64 epochs with about 100 steps per epoch
history = model.fit_generator(train_generator,
                              steps_per_epoch=train_generator.n // train_generator.batch_size,
                              epochs=epochs,
                              validation_data=val_generator,
                              validation_steps=val_generator.n // val_generator.batch_size,
                              callbacks=[earlyStopping, mcp_save]) #, reduce_lr_loss])

#Save the model
model.save_weights('/home/d/Desktop/s/categorical_weights.h5')
model.save('/home/d/Desktop/s/categorical_model_keras.h5')

#lets plot the train and val curve
#get the details form the history object
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

#Train and validation accuracy
plt.plot(epochs, acc, 'b', label='Training accuracy')
plt.plot(epochs, val_acc, 'r', label='Validation accuracy')
plt.title('Training and Validation accuracy')
plt.legend()

plt.figure()
#Train and validation loss
plt.plot(epochs, loss, 'b', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and Validation loss')
plt.legend()

plt.show()

model.evaluate_generator(generator=val_generator, steps=val_generator.n // val_generator.batch_size)

STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
test_generator.reset()
pred=model.predict_generator(test_generator,
steps=STEP_SIZE_TEST,
verbose=1)

predicted_class_indices=np.argmax(pred,axis=1)

labels = (train_generator.class_indices)
np.save('/home/d/Desktop/s/classes', labels)

labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]

filenames=test_generator.filenames
results=pd.DataFrame({"Filename":filenames,
                      "Predictions":predictions})
results.to_csv("categorical_results.csv",index=False)
YipingNUS commented 4 years ago

I faced the same issue. I have a medium model size (400k parameters bi-GRU with attention). The model fluctuates in the first 5 epochs to a very low accuracy (epoch size 25k text docs). The model is also predicting the same value regardless of the input. Since the sixth epoch, it magically starts to learn magically and reaches a good accuracy and no longer predicts the same value. So I guess I just have to be more patient.

DecaiJin commented 4 years ago

I also have the same problem,but i solve it by decreasing batch_size (batch_size = 1) and simplifing cnn structure .

TAUFEEQ1 commented 4 years ago

Generally, the more complex the network, the more weights to the network has to tune, so many of them become dead in the process, hence same output

On Mon, Dec 9, 2019 at 7:11 PM redbooks-jdc notifications@github.com wrote:

I also have the same problem,but i solve it by decreasing batch_size (batch_size = 1) and simplifing cnn structure .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/6447?email_source=notifications&email_token=AE2OFBN27WD2LMYHS7QKL5LQX4CG5A5CNFSM4DJRXEB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGMNYEQ#issuecomment-563665938, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2OFBNPS7E4VQBSLIG4JC3QX4CG5ANCNFSM4DJRXEBQ .

-- Nsamba Taufeeq Mobile 1: +256 751 830778 Mobile 2: +256 775 625741 Email:nsambataufeeq@gmail.com

Mariobahaa commented 4 years ago

After banging my head against the wall for over an hour with this same problem, and having tried all of the suggestions here without success, I eventually found the issue in my training data. I had an error in my normalization function and as a result, I had negative values in the training set. If none of the above suggestions work, this might be worth looking at. Cheers, Shane.

I had the same problem and when i saw this i printed a csv of my data, and found NaN values, after fixing my data it worked properly. Thank you

austinlostinboston commented 4 years ago

For me, changing the activation function for my hidden layers from tanh to relu's solved the problem. My guess is my issue was caused by some form of vanishing gradients.

MkUtkarsh commented 3 years ago

For me, changing the activation function for my hidden layers from tanh to relu's solved the problem. My guess is my issue was caused by some form of vanishing gradients.

yep this worked for me too

gangsteryoda commented 3 years ago

I suspect it means there is likely no signal in your covariates (could be a mistake in your data prep) for it to use so it just defaults to optimizing one single output that minimizes error as much as possible.

elanstop commented 3 years ago

In my case I mistakenly had many duplicated inputs in the training set whose labels did not match each other.

B0-B commented 3 years ago

I had once a similar problem and the key was stochastic gradient descend (single batch) and higher learning rate. The higher learning rate will make the loss somewhat more volatile but ensures not to get stuck at some saddle point i.e. the optimal solution for all inputs.

sOvr9000 commented 2 years ago

I had this issue with a CNN of over a million parameters. After reading most solutions posted here, I found that what worked for me was decreasing learning rate of the Adam optimizer to something below the default value assumed by Keras (0.001).

sOvr9000 commented 2 years ago

Actually, I've just realized that this is characteristic of vanishing gradients. Implement some batch normalization layers to help prevent this, or refer to this quick guide from the data science Stack Exchange: https://datascience.stackexchange.com/questions/72351/how-to-prevent-vanishing-gradient-or-exploding-gradient

azraimahadan commented 9 months ago

Actually I also encountered the same issue while fine tuning a BERT model. I managed to resolve this by removing the class weights assignment when instantiating the training.

sOvr9000 commented 9 months ago

Second time encountering this, but I figured out what's actually happening here. I analyzed the outputs of each layer (logits) to visualize the gradual flattening out of the values. The cause became quite obvious after seeing it. This problem is caused by the ReLU activation function. During training, perhaps with nonzero-centered data, vanishing gradients manifest as weights are pushed in the direction toward zero activation. In other words, weight updates cause forward passes to eventually be completely overwritten to zero by ReLU, resulting in all or the most frequent inputs to lead to an output of some constant vector which is calculated by only the last few layers. By this point, all or most gradients are zero.

The easiest and simplest ways to fix this are to include dropout layers (in order to discourage too many weights being pushed toward zero activation) and/or to replace the ReLU activation with Leaky ReLU (to sidestep the entire problem of zero activation and always allow an escape from the smaller gradients if necessary). Both of these methods are single lines of code added to the model composition.

Since this is related to vanishing gradients, it is also worth individually testing whether residual connections can mitigate the problem. And as a best practice, ensure that the entire dataset has a mean of zero and a variance of one. So there are three, or possibly four, good solutions to improving the learning stability and effectiveness of the model.

raidgpt commented 5 months ago

This is a weird and interesting problem. After one epoch the model will give one number for all input. Next epoch, the number is different with previous one, but all same in this time (for all input). So maybe not "vanishing gradients". More deeply . Still researching.

sOvr9000 commented 5 months ago

This is a weird and interesting problem. After one epoch the model will give one number for all input. Next epoch, the number is different with previous one, but all same in this time (for all input). So maybe not "vanishing gradients". More deeply . Still researching.

I learned what it is. If you print the weights and biases, some will have an extremely small norm, usually following or preceding others with large norms. They zero out the forward propagation and thus the output results in the same numbers for any input. That said, this is some form of a vanishing gradient problem since the gradients will become very small before the layer of small norm. You can prevent this by clipping norms: model.compile(optimizer=Adam(global_clipnorm=1)). That prevents the weights from becoming too large, so other weights are far less likely to become too small.