keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.88k stars 19.45k forks source link

model.evaluate() gives a different loss on training data from the one in training process #6977

Closed alanwang93 closed 3 years ago

alanwang93 commented 7 years ago

I'm implementing a CNN model, when I just have few layers, it works well. When I tried a deeper network, I can achieve a high performance (a small loss given during the training process) on training data, but when I use model.evaluate() on training data, I get a poor performance (much greater loss). I wonder why this will happen since the evaluation are all on training data.

Here is what I got:

input_shape = (X.shape[1], X.shape[2], 1)
model = Sequential()

y = [label2id[l] for l in labels.reshape(-1)]
y =  keras.utils.to_categorical(y)

model.add(Conv2D(32, (5, 5), strides=(2,2), input_shape=input_shape))
model.add(Activation('relu'))
model.add(BatchNormalization())

model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Conv2D(512, (1, 1)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Conv2D(15, (1, 1)))
model.add(Activation('relu'))
model.add(BatchNormalization())

model.add(GlobalAveragePooling2D())

model.add(Dense(500, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(15, activation='softmax'))

opt = Adam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

model.fit(np.expand_dims(X, axis=3), y, batch_size=200, epochs=15, validation_data=(np.expand_dims(X_val,3), y_val))

The log during training:

Train on 582 samples, validate on 290 samples
Epoch 1/15
582/582 [==============================] - 14s - loss: 2.6431 - acc: 0.1821 - val_loss: 2.6653 - val_acc: 0.0759
Epoch 2/15
582/582 [==============================] - 12s - loss: 2.3759 - acc: 0.3832 - val_loss: 3.9411 - val_acc: 0.0655
Epoch 3/15
582/582 [==============================] - 13s - loss: 2.0834 - acc: 0.4141 - val_loss: 7.2338 - val_acc: 0.0655
Epoch 4/15
582/582 [==============================] - 13s - loss: 1.8380 - acc: 0.5120 - val_loss: 9.4135 - val_acc: 0.0655
Epoch 5/15
582/582 [==============================] - 13s - loss: 1.6002 - acc: 0.5550 - val_loss: 10.0389 - val_acc: 0.0655
Epoch 6/15
582/582 [==============================] - 13s - loss: 1.3725 - acc: 0.6117 - val_loss: 11.0042 - val_acc: 0.0759
Epoch 7/15
582/582 [==============================] - 13s - loss: 1.1924 - acc: 0.6443 - val_loss: 10.2766 - val_acc: 0.0862
Epoch 8/15
582/582 [==============================] - 13s - loss: 1.0529 - acc: 0.6993 - val_loss: 9.2593 - val_acc: 0.0862
Epoch 9/15
582/582 [==============================] - 13s - loss: 0.9137 - acc: 0.7491 - val_loss: 9.9668 - val_acc: 0.0897
Epoch 10/15
582/582 [==============================] - 13s - loss: 0.7928 - acc: 0.7784 - val_loss: 9.4821 - val_acc: 0.0966
Epoch 11/15
582/582 [==============================] - 13s - loss: 0.6885 - acc: 0.8179 - val_loss: 8.7342 - val_acc: 0.1000
Epoch 12/15
582/582 [==============================] - 12s - loss: 0.6094 - acc: 0.8213 - val_loss: 8.5325 - val_acc: 0.1207
Epoch 13/15
582/582 [==============================] - 12s - loss: 0.5345 - acc: 0.8488 - val_loss: 7.9924 - val_acc: 0.1207
Epoch 14/15
582/582 [==============================] - 12s - loss: 0.4800 - acc: 0.8643 - val_loss: 7.8522 - val_acc: 0.1000
Epoch 15/15
582/582 [==============================] - 12s - loss: 0.4357 - acc: 0.8660 - val_loss: 7.1004 - val_acc: 0.1172

When I evaluate on training data:

score = model.evaluate(np.expand_dims(X, axis=3), y, batch_size=32)
print score
576/582 [============================>.] - ETA: 0s[7.6189327469396426, 0.10309278350515463]

On validation data

score = model.evaluate(np.expand_dims(X_val, axis=3), y_val, batch_size=32)
print score
288/290 [============================>.] - ETA: 0s[7.1004119609964302, 0.11724137931034483]

Could someone help me? Thanks a lot.

ouzan19 commented 7 years ago

Same problem happens for me...

danielS91 commented 7 years ago

It's due to the dropout layers. During the training phase neurons are dropped. In contrast during the prediction all neurons remain in the network structure. So it's quite likely that the results will be different. You can see it directly from the results for the validation data. They are equal, because both results are generated in the same way.

Edit: The batch normalization layers also influence the results.

danielS91 commented 7 years ago

Regarding the problem that both losses are quite different, it looks like that your model structure does not fit the problem well.

ouzan19 commented 7 years ago

Even without dropout layers and batch normalization, same issue continues for me. I don't agree that the problem is caused by the model structure because the training and the test data is the same.

danielS91 commented 7 years ago

How large is the difference in your case? Both loss values will not match exactly because during training the network parameters change from batch to batch and Keras will report the mean loss over all batches...

ouzan19 commented 7 years ago

I use only one batch. In training, final loss (mse) is 0.045. Evaluating with training data gives 1.14

danielS91 commented 7 years ago

That's strange. Did you try to use a different dataset? Can you provide some code to reproduce the problem? (small public dataset would be great)

bzhong2 commented 7 years ago

6895 I have a similar problem and even tried with the public data set. I was doing fine tuning.

fraztto commented 7 years ago

I had an issue like that one, the solution for me was very simple. I was evaluating using the train data and the accuracy was quite different than the one while training. When evaluating I had swapped the dims of the input images, height was width and width was height (silly me)

ouzan19 commented 6 years ago

Hi guys,

Other than dropout, batch norm also causes the same problem. I suspect that this is caused by the fact that the number of samples used in bacth norm after activation is 200 (bacth size) in tarining time, however it is only 1 in test time. This causes different normalization and different loss.

What's your thoughts?

renato145 commented 6 years ago

6895 Yes, I just encountered that problem with Resnet50.

BrianHuf commented 6 years ago

I'm running into the same problem. When I create learning curves from fit metrics, train and test look unrealistically different.

As an experiment, I tried calculating my own metrics.

class SecondOpinion(Callback):
    def __init__(self, model, x_train, y_train, x_test, y_test):
        self.model = model
        self.x_train = x_train
        self.y_train = y_train
        self.x_test = x_test
        self.y_test = y_test

    def on_epoch_end(self, epoch, logs={}):
        y_train_pred = self.model.predict(self.x_train)
        y_test_pred = self.model.predict(self.x_test)

        mse_train = ((y_train_pred - self.y_train) ** 2).mean()
        mse_test = ((y_test_pred - self.y_test) ** 2).mean()

        print("\n                                             Second Opinion loss: %5.4f - val_loss: %5.4f" % (mse_train, mse_test))

...

model.compile(
    loss='mean_squared_error',
    optimizer=adam
)

second_opinion = SecondOpinion(model, data.x_train, data.y_train, data.x_test, data.y_test)

model.fit(
    x=data.x_train,
    y=data.y_train,
    validation_data=(data.x_test, data.y_test),
    batch_size=200,
    epochs=200
    callbacks=[second_opinion]
)

With batch normalization and drop out included, train loss is very different (~3x). Validation losses are different, but not substantial.

Epoch 1/200
7200/7255 [============================>.] - ETA: 0s - loss: 208810.7629
                                             Second Opinion loss: 147483.0938 - val_loss: 164947.0781
7255/7255 [==============================] - 59s 8ms/step - loss: 207874.9320 - val_loss: 140131.2018

Epoch 2/200
7200/7255 [============================>.] - ETA: 0s - loss: 57029.7061
                                             Second Opinion loss: 128558.4609 - val_loss: 142726.4375
7255/7255 [==============================] - 55s 8ms/step - loss: 57108.7740 - val_loss: 135797.0371

Epoch 3/200
7200/7255 [============================>.] - ETA: 0s - loss: 49392.7298
                                             Second Opinion loss: 154096.3281 - val_loss: 173001.8438
7255/7255 [==============================] - 55s 8ms/step - loss: 49370.2950 - val_loss: 151737.2370

With batch normalization and drop out removed, loss is somewhat different and val_loss matches

Epoch 1/200
7200/7255 [============================>.] - ETA: 0s - loss: 1691567.5816
                                             Second Opinion loss: 592996.7500 - val_loss: 631589.8125
7255/7255 [==============================] - 35s 5ms/step - loss: 1682561.1545 - val_loss: 631589.8356

Epoch 2/200
7200/7255 [============================>.] - ETA: 0s - loss: 557553.0530
                                             Second Opinion loss: 503776.0000 - val_loss: 539686.3750
7255/7255 [==============================] - 32s 4ms/step - loss: 557585.9540 - val_loss: 539686.4883

Epoch 3/200
7200/7255 [============================>.] - ETA: 0s - loss: 434417.9800
                                             Second Opinion loss: 353186.8750 - val_loss: 383728.2500
7255/7255 [==============================] - 32s 4ms/step - loss: 434553.5198 - val_loss: 383728.2623

I'm not schooled enough to know if these differences are intentional by Keras or not. Anyone?

mikowals commented 6 years ago

I am new to Keras so maybe this is expected behaviour but I can't find it documented in .fit() or .evaluate() that .fit() must be run first.

model.evaluate consistently gets a wrong result if run on after loading saved weights. Running model.fit even training for 1 step with a learning rate of 0 fixes the validation results though the weights should not be changed.

Loading weights from the file again after model.fit() will cause the problem with model.evaluate() to reoccur.

Result of initial evaluate:

10000/10000 [==============================] - 27s 3ms/step Loss: 14.154, Accuracy: 0.114

Now train one step:

Train on 256 samples, validate on 256 samples Epoch 1/1 256/256 [==============================] - 46s 178ms/step - loss: 0.7133 - acc: 0.7930 - val_loss: 0.7081 - val_acc: 0.7930

Now run same evaluate call again:

10000/10000 [==============================] - 24s 2ms/step Loss: 0.659, Accuracy: 0.798

The code to produce this is:

from __future__ import print_function
import keras

batch_size = 256
img_rows, img_cols, img_channels = 32, 32, 3

(_, _), (x_test, y_test) = keras.datasets.cifar10.load_data()

x_test = x_test.astype('float32') / 255.
y_test = keras.utils.to_categorical(y_test, num_classes=10)

model = keras.applications.nasnet.NASNetMobile(
    input_shape=(img_rows, img_cols, img_channels),
    weights=None,
    classes=10
)
model.load_weights('weights.81-0.35-0.873.txt', by_name=True)
optimizer = keras.optimizers.SGD(lr=0.000, momentum=0.0, clipnorm=5)
model.compile(loss=['categorical_crossentropy'],
          optimizer=optimizer, metrics=['accuracy'])

def eval():
    metrics = model.evaluate(
        x=x_test,
        y=y_test,
        batch_size=batch_size,
        verbose=1,
        sample_weight=None
    )
    print ('Loss: {:.3f}, Accuracy: {:.3f}'.format(metrics[0], metrics[1]))

eval()
model.fit(x=x_test[:batch_size,...], y=y_test[:batch_size,...], batch_size=batch_size, epochs=1,     validation_data=(x_test[:batch_size,...], y_test[:batch_size,...]))
eval()

I am using Keras (2.1.4) installed by pip on MacOS 10.13.4. This version of Keras is printing a ton of deprecation warnings (from tensorflow I think) which I have omitted in the output for clarity but if you see them it is not a problem with the code.

weights.81-0.35-0.873.txt

BrianHuf commented 6 years ago

I'm still in over my head here, but here's how things appear to me. Can anyone confirm I'm on the right track?

This is all tied to learning_phase (see https://keras.io/backend/) and loss/metric estimation based on batches.

Dropout is only active when the learning_phase is set to test. Otherwise, it should be ignored. It's unclear to me if BatchNormalization is active when learning_phase is test

Batching presumes each batch can represent the entire data set. If the data is heavily skewed or batches aren't well randomized, I can imagine this will magnify the differences between losses from fit vs. predict.

It seems to me that Learning Curves are more correct when evaluating losses and metrics when learning_phase is set to test and applied across all batches. I can imagine this is not done during fit, because it is computationally expensive.

lorenzoriano commented 6 years ago

I'm seeing the same problem

Deepu14 commented 6 years ago

I have same problem. Epoch 28/30 5760/5760 [==============================] - 4s 641us/step - loss: 0.0166 - acc: 0.9934 - val_loss: 0.0299 - val_acc: 0.9891

Epoch 29/30 5760/5760 [==============================] - 4s 644us/step - loss: 0.0163 - acc: 0.9932 - val_loss: 0.0296 - val_acc: 0.9875

Epoch 30/30 5760/5760 [==============================] - 4s 641us/step - loss: 0.0165 - acc: 0.9925 - val_loss: 0.0318 - val_acc: 0.9875

Evaluating on test data: `1712/1712 [==============================] - 0s 236us/step $loss [1] 0.329597

$acc [1] 0.9281542`

There is a huge difference between train-validation loss and test loss.

emerygoossens commented 6 years ago

I am having the same issue. I train a model, save the weights, load the model. The resulting evaluation call is giving results that change each time.

azmathmoosa commented 6 years ago

I, too have the same issue. I was training a DenseNet121 with all layers frozen except the last 1 or 2.

Epoch 00032: val_acc did not improve from 0.29563
Epoch 33/90
154/154 [==============================] - 148s 963ms/step - loss: 0.1546 - acc: 0.9538 - val_loss: 6.4297 - val_acc: 0.2246

Epoch 00033: val_acc did not improve from 0.29563
Epoch 34/90
154/154 [==============================] - 148s 963ms/step - loss: 0.1416 - acc: 0.9573 - val_loss: 6.1487 - val_acc: 0.2423

Epoch 00034: val_acc did not improve from 0.29563
Epoch 35/90
154/154 [==============================] - 147s 955ms/step - loss: 0.1415 - acc: 0.9556 - val_loss: 6.6624 - val_acc: 0.2016

Epoch 00035: val_acc did not improve from 0.29563
Epoch 36/90
154/154 [==============================] - 147s 957ms/step - loss: 0.1457 - acc: 0.9545 - val_loss: 5.9998 - val_acc: 0.2548

Epoch 00036: val_acc did not improve from 0.29563
Epoch 00036: early stopping
154/154 [==============================] - 191s 1s/step
Final Training loss: 6.1547
Training accuracy:  0.2037

I ran evaluate() on the training data itself, and the validation data between each epoch is also the training data! Yet the difference is huge.

I'm planning to drop Keras and move to TF.

raghavab1992 commented 6 years ago

I am facing the same issue...trying to finetune inception_v3. Added two Dense layers and set all other inception layers trainable=False. So without any dropout layers, getting completely different metrics for training data during training and evaluation!! Epoch 1/25 35/35 [==============================] - 24s 693ms/step - loss: 2.1526 - categorical_accuracy: 0.2010 - val_loss: 12.1775 - val_categorical_accuracy: 0.0993 Epoch 2/25 35/35 [==============================] - 19s 557ms/step - loss: 1.8757 - categorical_accuracy: 0.3301 - val_loss: 12.5643 - val_categorical_accuracy: 0.1066 Epoch 3/25 35/35 [==============================] - 19s 533ms/step - loss: 1.6845 - categorical_accuracy: 0.4497 - val_loss: 12.5669 - val_categorical_accuracy: 0.1176

print(model.metrics_names, model.evaluate_generator(train_gen), model.evaluate_generator(val_gen)) ['loss', 'categorical_accuracy'] [12.482194125054637, 0.0966271650022339] [12.378837978138643, 0.10294117647058823]

As none of the inception layers are being trained, the batch norm layers should use default mean and std dev and hence shouldn't give different results in training and evaluation phase! Any idea why this is happening?

ub216 commented 6 years ago

Has anyone solved this? Having the same issue. model.eval gives completely different results compared to model.fit (learning rate set to zero). Don't use dropout layer. Tried playing with the batch_norm layers "trainable" parameter but got similar performance.

shunjiangxu commented 6 years ago

I am having the same problem as well. In my case, I am trying to reuse the pre-trained keras ResNet50 model and add my own last few layers. I got very large differences between .fit and .evaluate using the same training data. When I look at the prediction result using the training data, it's clear the .evaluate gives the right loss/accuracy. Anyone has any ideas? I don't believe the batchnorm/dropout layer is the reason here. Below is my differences: From .fit: Epoch 1/1 657/657 [==============================] - 327s 498ms/step - loss: 0.1465 - acc: 0.9691

From evaluate with the same training data: 657/657 [==============================] - 356s 542ms/step [2.496475939699867, 0.4247050990252734]

j0bby commented 6 years ago

Hello everyone,

Here is the official Kera's answer to this question. https://keras.io/getting-started/faq/#why-is-the-training-loss-much-higher-than-the-testing-loss

Even without dropout or batch normalization, the problem will persist. The reason for this is that when you use fit, at each batch of the training data the weights are updated. The loss value returned by the fit method is not the mean of the loss of the final model, but the mean of the loss of all slightly different models used on each batch. On the other hand, when you use evaluate, the same model is used on the whole dataset. And this model actually doesn't even appear in the loss of the fit method since even at the last batch of training, the loss computed is used to update the model's weights.

To sum everything up, fit and evaluate have two completely different behavior, and comparing their output doesn't make any sense !

ub216 commented 6 years ago

Hey j0bby, Thanks for your reply. The link that you referred to is an expected behavior when taken over the whole epoch. However, I see this discrepancy even when testing on a single batch of the data! Moreover, I also tried setting "loss_weights" to zero so as to have zero gradients and still model.fit() gives different (better) performance compared to model.evaluate(). Furthermore, if you notice from shunjiangxu's post model.fit() is doing better than model.evaluate() and not worse as explained in your link.

j0bby commented 6 years ago

Hello @ub216, May I ask what is your model? If you have some sort of regularizer, your gradient is not 0. My model does include dropout and no regularizer. It has 3 outputs on which I compute the loss as well. When I try to set loss_weights to 0.0, after one epoch on 1 batch, the overall loss returned is 0.0 (as expected), the loss computed for each output is greater than 0.0. However, validation and training loss are different as expected because of the dropout. Finally, some training loss is greater than validation loss and some are lower.

Here is how to have the same output from fit and evaluate :

About @shunjiangxu 's results: The two methods return different results as expected. However, in that case, the evaluate method expected to have better results is performing worth. This can have different explanations depending on the hyperparameters of the model, and the training.

mikowals commented 6 years ago

Hi @j0bby,

Thanks for trying to get to the bottom of this. If you run your code with the order of commands reversed do you still get matching results? Like this:

In my example above from the 19th of Feb running fit() first works as you say but why does fit() need to be run first? Is this behaviour documented somewhere?

ub216 commented 6 years ago

Hey @j0bby Thanks again for your prompt reply. I don't have dropout or regularizes. My model has three outputs and the total loss is a weighted loss of the three. For debugging purpose I have set their weights to zero. When I run:

model.fit(x_train,[y_train1,y_train2,y_train3],validation_data=(x_train,[y_train1,y_train2,y_train3])

I get:

Epoch 1/1 12/12 [==============================] - 1s 77ms/step - loss: 0.0000e+00 - out1_loss: 2.8200 - out2_loss: 0.3365 - out3_loss: 1.8442 - out1_categorical_cross_entropy4d_split: 0.0660 - out2_mean_squared_error: 0.1878 - out3_categorical_accuracy: 0.2500 - val_loss: 0.0000e+00 - val_out1_loss: 2.3867 - val_out2_loss: 0.3041 - val_out3_loss: 0.7214 - val_out1_categorical_cross_entropy4d_split: 0.0337 - val_out2_mean_squared_error: 0.1578 - val_out3_categorical_accuracy: 0.0000e+00

I checked the difference in weights before and after executing the command but the weights haven't changed! Any idea/pointers on why this discrepancy?

@mikowals I had the same issue with this model as well. I'm trying to figure this out first but maybe they are related.

shunjiangxu commented 6 years ago

Thanks for following this up. I am trying to dig out the reason for this. For my case, the evaluate gave a much much worse [loss, accuracy] result than the .fit. I am trying to use a VGG16 model instead of the Resnet50. @ub216 if initial all the weights to be 0, then they will always stay at 0 during training due to the symmetry breaking issue, is that right?

ub216 commented 6 years ago

@shunjiangxu By "weights" I mean the weights for the weighted loss not the NN parameters, sorry for the confusion.

shunjiangxu commented 6 years ago

@ub216 All right, sorry I did not understand correctly. I was searching for a Keras callback function/parameters to print out the .fit output but can't seem to find it. The only way seems to run the .evaluate with on_batch_end/on_epoch_end. But that is not really what the .fit has calculated. Does anyone know if callback can get the .fit 'prediction' output?

j0bby commented 6 years ago

Hi @mikowals, usually you want to call fit() before eval() because there is no interest in evaluating a model randomly initialized. The order does have an influence on the results since calling fit() will update the weights of your network. However, if you make sure to have no gradient when training, then calling eval() before and after will return the same values (tested with a simple conv1D layer).

@ub16, I would recommend starting your model from scratch, by including one layer at a time in your model and testing if fit() returns the same loss and _valloss. You will then spot the layer that has a different behavior in learning phase. As stated in the link I provided, your model has a training mode (used to compute the loss on the training data) and a testing mode (used on the validation data). Some layers like batchNormalization or Dropoutare not active anymore in the testing mode. If you use a simple network [Conv1D, Flatten, Dense] (and the _weightloss set to 0 still) the loss and _valloss are equal. Add a Dropout layer and values are different.

ub216 commented 6 years ago

@j0bby Thanks again for those inputs. I was trying that yesterday and I realized that as soon as I remove all the batchNormalization layers the loss and _valloss have same behavior. The abnormal behavior returns even when I set their training flag to False. Trying to figure out why this is the case now.

andrey999333 commented 6 years ago

I have the same problem. My model has densenet121 convolutional base, with couple of additional layers on top. All layers except for the two last ones are set to layer.trainable = False. My loss is "mse", since it's a regression. During training i get loss~3, while evaluation on the very same batch gives loss~30

regressor_modified.fit(x=dat[0],y=dat[1],batch_size=32)

Epoch 1/1 32/32 [==============================] - 0s 11ms/step - loss: 2.5571 <keras.callbacks.History at 0x7f5e44307908>

regressor_modified.evaluate(x=dat[0],y=dat[1])

32/32 [==============================] - 2s 59ms/step 29.276123046875

andrey999333 commented 6 years ago

It looks like I found the reason why this is happening. Check it out here

ub216 commented 6 years ago

@andrey999333 realized this on Friday just before leaving work. Was going to post it later today... You beat me to it! :)

shunjiangxu commented 6 years ago

Thanks everyone for all spending the time researching this issue and andrey999333 for the answer to this mystery. It looks like this Keras batchnorm issue has caused lots of headaches for people in transfer learning. Below is a blog post I found from internet specifically talking about this. http://blog.datumbox.com/the-batch-normalization-layer-of-keras-is-broken/ Check it out and it has some solution proposals as well.

nicolefinnie commented 5 years ago

Here is how to have the same output from fit and evaluate :

  • model.fit(x_train,y_train,validation_data=(x_train,y_train)
  • model.evaluate(x_train,y_train) Then the metrics on the validation set from the fit method are equals to the one from the evaluate method. The loss on the training and validation are different (testing is better), even if the dataset is the same, as explained in the link.

@j0bby I thought they should be the same, but the results show otherwise. In my case, model.evaluate() yields a much higher top-1 and top-k accuracy (defined in metrics) on the validation/test data than model.fit(.... validation_data=(x_val,y_val)) .

sample code model.fit(x_train,y_train,validation_data=(x_val,y_val) model.evaluate(x_val,y_val)

However, evaluate() yields a lot higher loss than the validation epoch during fit() but a higher accuracy. And I don't know which loss/acc we should trust, during fit(validation) or evaluate(). I used the keras default model mobilenet_v2 which contains batch_norm layers. I wonder if it's where the culprit comes from.

UPDATE on Jan 10, 2019 I checked the code in model.fit() and model.evaluate() and ran some testing and I can confirm both of them call test_loop() in training_arrays.py which yield the same result. I also checked the learning phase of the batch_norm layer, it was 0 during training and testing, so it wasn't the problem. My bug was that I passed the original images to model.fit(validation=) and the normalized images to model.evaluate() and the latter one outperforms the previous one. Sorry for my misleading information.

bnaman50 commented 5 years ago

Hello everyone,

Here is the official Kera's answer to this question. https://keras.io/getting-started/faq/#why-is-the-training-loss-much-higher-than-the-testing-loss

Even without dropout or batch normalization, the problem will persist. The reason for this is that when you use fit, at each batch of the training data the weights are updated. The loss value returned by the fit method is not the mean of the loss of the final model, but the mean of the loss of all slightly different models used on each batch. On the other hand, when you use evaluate, the same model is used on the whole dataset. And this model actually doesn't even appear in the loss of the fit method since even at the last batch of training, the loss computed is used to update the model's weights.

To sum everything up, fit and evaluate have two completely different behavior, and comparing their output doesn't make any sense !

Hey @j0bby, thanks for your explanation but I believe this problem still persists even if I take care of the things that you just mentioned.

Explanation of the Task It is supposed to be a very simple task. All I am doing is to overfit to my own dataset of 256 images (29x29x3) with 256 output classes (one for each image).

Dataset used

Case 1 x_train = All the pixel values in the image = i where i goes from 0 to 255. y_train = i

Case 2 x_train = Centre 5*5 patch of the pixel values in the image = i where i goes from 0 to 255. All the other pixel values are same for all the images. y_train = i

This gives me 256 images in total for the training data in each case. (It would be more clear if you just have a look at the code)

Here is my code to reproduce the issue -

from __future__ import print_function

import os

import keras
from keras.datasets import mnist
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Activation
from keras.layers.normalization import BatchNormalization
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, Callback
from keras import backend as K
from keras.regularizers import l2

import matplotlib.pyplot as plt
import PIL.Image
import numpy as np
from IPython.display import clear_output

# The GPU id to use, usually either "0" or "1"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"

# To suppress the warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## Hyperparamters
batch_size = 256
num_classes = 256
l2_reg=0.0
epochs = 500

## input image dimensions
img_rows, img_cols = 29, 29

## Train Image (I took a random image from ImageNet)
train_img_name = 'n01871265_279.JPEG'
ret = PIL.Image.open(train_img_name) #Opening the image
ret = ret.resize((img_rows, img_cols)) #Resizing the image
img = np.asarray(ret, dtype=np.uint8).astype(np.float32) #Converting it to numpy array
print(img.shape) # (29, 29, 3)

## Creating the training data
#############################
x_train = np.zeros((256, img_rows, img_cols, 3))
y_train = np.zeros((256,), dtype=int)
for i in range(len(y_train)):
    temp_img = np.copy(img)
    ## Case1 of dataset
    # temp_img[:, :, :] = i # changing all the pixel values
    ## Case2 of dataset
    temp_img[12:16, 12:16, :] = i # changing the centre block of 5*5 pixels
    x_train[i, :, :, :] = temp_img
    y_train[i] = i
##############################

## Common stuff in Keras
if K.image_data_format() == 'channels_first':
    print('Channels First')
    x_train = x_train.reshape(x_train.shape[0], 3, img_rows, img_cols)
    input_shape = (3, img_rows, img_cols)
else:
    print('Channels Last')
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3)
    input_shape = (img_rows, img_cols, 3)

## Normalizing the pixel values
x_train = x_train.astype('float32')
x_train /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')

## convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)

## Model definition
def model_toy(mom):
    model = Sequential()

    model.add( Conv2D(filters=64, kernel_size=(7, 7), strides=(1,1), input_shape=input_shape, kernel_regularizer=l2(l2_reg)) )
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))
    #Default parameters kept same as PyTorch
    #Meaning of PyTorch momentum is different from Keras momentum.
    # PyTorch mom = 0.1 is same as Keras mom = 0.9

    model.add( Conv2D(filters=128, kernel_size=(7, 7), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=256, kernel_size=(5, 5), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=512, kernel_size=(5, 5), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=1024, kernel_size=(5, 5), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add( Conv2D( filters=2048, kernel_size=(3, 3), strides=(1, 1), kernel_regularizer=l2(l2_reg) ) )
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=4096, kernel_size=(3, 3), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    # Passing it to a dense layer
    model.add(Flatten())

    model.add(Dense(1024, kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    # Output Layer
    model.add(Dense(num_classes, kernel_regularizer=l2(l2_reg)))
    model.add(Activation('softmax'))

    return model

mom = 0.9 #0
model = model_toy(mom)
model.summary()

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(lr=0.001),
              #optimizer=keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.0, nesterov=True),
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    shuffle=True,
                   )

print('Training results')
print('-------------------------------------------')
score = model.evaluate(x_train, y_train, verbose=1)
print('Training loss:', score[0])
print('Training accuracy:', score[1])
print('-------------------------------------------')

Small Note - I was able to successfully do this task in PyTorch. It is just that my actual task requires me to have a Keras model. That's why I have changed the default values of the BatchNorm layer (the root cause of the issue) according to the ones I used to train PyTorch model.

Here is the image that I used in my code.

Here are the results of training. Case1 of the dataset Case2 of the dataset

If you look at these two files, you would be able to notice the discrepancies in the training loss during training vs inference. (I have specifically set my batch size to be equal to the size of my training data).

Next, I looked at the source code of the Keras to see if there is any way I can make the BatchNorm layer use the batch statistics instead of the running mean and variance. Here is the update formula that Keras (backend - TF) uses to update the running mean and variance. #running_stat -= (1 - momentum) * (running_stat - batch_stat) So if I set the momentum value to be 0, it would mean that the value assigned to the runing_stat would always be equal to batch_stat during the training phase. Thus, the value it will use during inference mode will also be same (close) as batch/dataset statistics. Here are the results for this little experiment with the same issue still occurring. Case1 of the dataset Case2 of the dataset

Programming Environment - Python-3.5.2, tensorflow-1.10.0, keras-2.2.4 I tried the same thing with tensorflow-1.12.0, keras-2.2.2 as well but it still did not solve the issue.

tdoneal commented 5 years ago

It seems like there are issues with both Dropout and BatchNormalization relating to an "incorrect" evaluate() result.

When a Dropout layer is set to trainable=False, the results should be consistent across fit() and evaluate() but aren't, at least in my scenario. Perhaps (and this is just a guess) the % neurons dropped remains at 0.5 (or whatever) even when trainable=False is set. If that's the case, it'd probably be better to set it to 1.0 to be consistent with the semantics of "not trainable".

As mentioned elsewhere, BatchNormalization just plain doesn't work as expected with evaluate(). Perhaps for an analogous reason.

It seems there are two competing concepts of "frozen": Layer.trainable and "training mode". If those conflict, confusion results.

See @fchollet response at https://github.com/keras-team/keras/pull/9965

vmirly commented 5 years ago

I think the issue is not just for Droput and BN. I have an RNN model which also shows the same issue. In this case, the computed loss is really off, since the accuracy is quite good. So, if the accuracy is good, the loss cannot be that high.

here is my model:

# build the model
model = tf.keras.Sequential([
    Embedding(input_dim=vocab_size,
              output_dim=embedding_dim,
              name='embed-layer'),

    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64), 
        name='bidirectional-lstm'),

    tf.keras.layers.Dense(64, activation='relu'),

    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])

train_data = imdb_train_encoded.padded_batch(
    BATCH_SIZE, padded_shapes=([-1],[]))

valid_data = imdb_valid_encoded.padded_batch(
    BATCH_SIZE, padded_shapes=([-1],[]))

model.fit(train_data, validation_data=valid_data, 
          epochs=10)

So, I computed the loss by manually iterating over the validation-dataset, and I got 0.11. but when I use mode.evaluate() I get 0.489 for the loss:

loss_fn = tf.keras.losses.BinaryCrossentropy(
    from_logits=False, reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE)

n = 0
tot_loss = 0.0
for batch in valid_data.take(2):
    output = model2(batch[0])
    loss = loss_fn(y_true=batch[1], y_pred=output)
    n += len(batch[0])
    tot_loss += loss.numpy()

print(tot_loss/n)

image

franchesoni commented 5 years ago

Similar issue. When using .fit_generator one can pass the validation set as an argument. Defining a custom callback and printing the output of .evaluate over the training and test set resulted in the following conclusions:

This happens with and without Dropout. This is consistent with @j0bby 's answer: a rolling mean over a model that varies with time.

Here is the official Kera's answer to this question. https://keras.io/getting-started/faq/#why-is-the-training-loss-much-higher-than-the-testing-loss

Osdel commented 5 years ago

I had that issue and I solved by initializing the Session, then initializing the variables, and loading the weights of the model. Ex:

with tf.Session() as sess: sess.run(tf.global_variables_initializer()) sess.run(init_op) #initializer for the data input pipeline model.load_weights(model_path,by_name=True)

DuyHuynhLe commented 5 years ago

I ran into a similar issue recently and found a solution to my problem. I hope it works in your cases.

The problem: Model A that was trained and validated on a couple (train,val) using fit_generator. That same model was evaluated on (val) using evalutate_generator and give the same results as the validation metrics of the last epoch. Weights of A are saved after each epoch using keras.callbacks.ModelCheckpoint .

Model B was constructed using the same script as A. The last saved weights of A are loaded into B using load_weights. The results of B on (val) using evalutate_generator is different from what we obtained with A.

I re-ran the test on one notebook and checked the weights of A and B. They seem to be identical. However, upon close inspection, I found that the total number of weights matrixes is wrong.

The cause: The model has a custom layer, which is implemented by subclassing keras.layers.Layer. I did that before, but usually, my custom layer performed some stateless transformation. This time, the custom layer is composed of some Conv layers. The weights of these hidden Conv layers can not be accessed with get_weights() and are not saved.

Solution: I re-implemented that custom layer by inheriting keras.Model. An example is given in https://www.tensorflow.org/tutorials/customization/custom_layers. B and A now give identic results.

offchan42 commented 5 years ago

If you are doing transfer learning, read this for solution.

The problem is happening because of the BatchNorm layer. I have the exact same problem as the author. The behavior of this layer is not the same when training and when predicting.

Note that it is not the problem of accumulating estimated loss after several steps. Because the training loss is already broken even within one batch. Look at the following image: image I used a variant of MobileNetV2 architecture with pre-trained ImageNet weights. This model has only a BatchNorm layer, no Dropout. You see that I get lower training loss when training than when testing, and that's just one batch of data, not one epoch. In the ideal case, the last 2 lines should report similar numbers as they are evaluating with the same weights.

This BatchNorm behavior seems to work fine with classification problems, but it's broken with my regression problem.

See this issue: https://github.com/keras-team/keras/pull/9965

U-C-J commented 4 years ago

Thanks @j0bby.

The website description: Besides, the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss.

example: train 50000, validate using training data without dropout Train on 50000 samples, validate on 50000 samples model.fit(x_train, y_train, epochs=my_epochs,validation_data=(x_train, y_train),batch_size = 128,verbose = 2)

Epoch 1/3 50000/50000 - 33s - loss: 1.5784 - accuracy: 0.4168 - val_loss: 3.4444 - val_accuracy: 0.1693 Epoch 2/3 50000/50000 - 32s - loss: 1.1845 - accuracy: 0.5782 - val_loss: 1.2492 - val_accuracy: 0.5523 Epoch 3/3 50000/50000 - 32s - loss: 1.0009 - accuracy: 0.6444 - val_loss: 1.0644 - val_accuracy: 0.6216

And then we use evaluate model.evaluate(x_train, y_train, verbose=2) 50000/50000 - 15s - loss: 1.0644 - accuracy: 0.6216

You can see that actually you get same fit and evaluate result for training data. The loss and accuracy displayed during each epoch is the average of losses over each batch of training data.

My understanding: For simplicity, if you have only 6 data, and your batch is 2, then you iterate 3 times to complete an epoch. Now let us just see one epoch. Epoch 1fit stage: Epoch 1, iteration 1 (data 1-2), you get model M11, loss for (data 1-2) is L11, acc is A11 Epoch 1, iteration 2 (data 3-4), you get model M12, loss for (data 3-4) is L12, acc is A12 Epoch 1, iteration 3 (data 5-6), you get model M13, loss for (data 5-6) is L13, acc is A13 Epoch 1 evaluate stage: Epoch1, use model 13 to evaluate the validating dataset, loss is L1, acc is A1

Then the system output loss for Epoch 1 will be: loss: (L11+L12+L13)/3, acc: (A11+A12+A13)/3

Thus the fit loss and accuracy should be different from evaluation. And why fit accuracy is always higher? Because model M11 is to fit data 1-2 and loss is caluated based on data 1-2. If you calcuate loss for data 1-6, this should be lower than evaluate. Same case for M12 and M13.

LaurentBerger commented 4 years ago

I think there is a problem but it's only when values are print. Let's try with this small sample :

import tensorflow as tf
import numpy as np
import cv2 as cv

BATCH_SIZE  = 32
num_epochs = 2

mnist = tf.keras.datasets.mnist
(img_train, label_train ), (img_test, label_test) = mnist.load_data()
# Normalisation entre 0 et 1
img_train = img_train / 255
dataset = tf.data.Dataset.from_tensor_slices(( img_train[0:BATCH_SIZE] , label_train[0:BATCH_SIZE] ))
dataset = dataset.repeat( num_epochs ).batch( BATCH_SIZE )

model = tf.keras.Sequential()
couche0 = tf.keras.layers.Flatten()
couche1 = tf.keras.layers.Dense(32, activation='relu')
couche2 = tf.keras.layers.Dense(10,activation='softmax')
model.add(couche0)
model.add(couche1)
model.add(couche2)
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.fit(dataset,steps_per_epoch=1,epochs=num_epochs,callbacks=[tensorboard_callback])
score, acc = model.evaluate(img_train[0:BATCH_SIZE], label_train[0:BATCH_SIZE],verbose=0)
print('Test score:', score)
print('Test score:', acc)
model.fit(dataset,steps_per_epoch=1,epochs=1,callbacks=[tensorboard_callback])

Results

W1208 11:46:20.395202   904 base_layer.py:1814] Layer sequential is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Train for 1 steps
Epoch 1/2
1/1 [==============================] - 0s 219ms/step - loss: 2.4898 - accuracy: 0.0625
Epoch 2/2
1/1 [==============================] - 0s 16ms/step - loss: 2.3725 - accuracy: 0.0938
Test score: 2.271101951599121
Test score: 0.0625
Train for 1 steps
1/1 [==============================] - 0s 16ms/step - loss: 2.2711 - accuracy: 0.0625

As you can see evaluate gives same first result of first iteration of fit

SimonZhao777 commented 4 years ago

Has anyone solved the problem yet, I'm facing the same problem here... I'm using keras version 2.2.4 and tensorflow version 1.5.0, I tried to print the result for every 10 training epochs, and the results of model.evaluate, model.predict, model.test_on_batch are all consistent but none of them are same from training phase results even when I used the same training data for all of them.

here are the results:

epoch=281, loss=16.09882 max_margin_loss=15.543743 ortho_loss=0.5550766 epoch=282, loss=15.947226 max_margin_loss=15.379948 ortho_loss=0.5672779 epoch=283, loss=15.848539 max_margin_loss=15.284585 ortho_loss=0.56395435 epoch=284, loss=15.519976 max_margin_loss=14.971162 ortho_loss=0.5488138 epoch=285, loss=14.816533 max_margin_loss=14.289791 ortho_loss=0.526742 epoch=286, loss=14.412685 max_margin_loss=13.907438 ortho_loss=0.5052471 epoch=287, loss=14.295979 max_margin_loss=13.805334 ortho_loss=0.49064445 epoch=288, loss=14.7037945 max_margin_loss=14.220262 ortho_loss=0.4835329 epoch=289, loss=14.691599 max_margin_loss=14.213996 ortho_loss=0.47760296 epoch=290, loss=14.596203 max_margin_loss=14.125141 ortho_loss=0.4710617 model.evaluate======== train_loss=[20.45014190673828, 19.984760284423828] val_loss=[20.450117111206055, 19.984760284423828] test_loss=[20.450183868408203, 19.984760284423828] model.predict======== prediction_result len=2708 [[19.999733] [19.963854] [20.013517] ... [20.03875 ] [20.024363] [20.024124]] model.test_on_batch======== test_train_batch_result=[20.450142, 19.98476] test_val_batch_result=[20.450117, 19.98476] test_test_batch_result=[20.450184, 19.98476]

andrey999333 commented 4 years ago

it might be the problem i have described in the stackoverflow post: https://stackoverflow.com/questions/51123198/strange-behaviour-of-the-loss-function-in-keras-model-with-pretrained-convoluti

SimonZhao777 commented 4 years ago

Hey guys, I found an easy solution which works at least in my case (My model has Dropout layer but with no BatchNormalization layer), thanks to OverLordGoldDragon in the link here and here

The easy fix for me is to set keras learning phase to 0 before building and initializing my model: here is a demo code:

import keras.backend as K

    K.set_learning_phase(0)

    Then the model building and compiling code...

Now the four results(model.train_on_batch, model.evaluate, model.predict, model.test_on_batch) are all as expected.

below are the experiment output: epoch=882, loss=8.4112625 max_margin_loss=7.6551723 ortho_loss=0.75609016 epoch=883, loss=8.406249 max_margin_loss=7.6501327 ortho_loss=0.7561164 epoch=884, loss=8.400357 max_margin_loss=7.644247 ortho_loss=0.7561102 epoch=885, loss=8.395483 max_margin_loss=7.639352 ortho_loss=0.7561312 epoch=886, loss=8.398947 max_margin_loss=7.642764 ortho_loss=0.7561827 epoch=887, loss=8.394142 max_margin_loss=7.6379457 ortho_loss=0.7561965 epoch=888, loss=8.387917 max_margin_loss=7.63174 ortho_loss=0.7561765 epoch=889, loss=8.383256 max_margin_loss=7.6270676 ortho_loss=0.7561884 epoch=890, loss=8.386976 max_margin_loss=7.6307592 ortho_loss=0.756217 model.evaluate======== train_loss=[8.382329940795898, 7.62611198425293] val_loss=[8.38233470916748, 7.62611198425293] test_loss=[8.382333755493164, 7.62611198425293] model.predict======== prediction_result len=2708 [[11.143183 ] [ 2.2248592] [ 4.534893 ] ... [ 7.269316 ] [ 9.213724 ] [ 5.3815193]] model.test_on_batch======== test_train_batch_result=[8.38233, 7.626112] test_val_batch_result=[8.382335, 7.626112] test_test_batch_result=[8.382334, 7.626112]

liangsun-ponyai commented 4 years ago

@Osdel Why your change will fix your problem?

Echosanmao commented 4 years ago

Same problem happens for me... hello~ Have you resolve this question? Could you tell me some thing?Thank you!!