keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.64k stars 19.42k forks source link

BatchNormalization does not work with autoregressive models #6623

Closed barvinograd closed 3 years ago

barvinograd commented 7 years ago

When batchnorm is used inside a model that is invoked more than once (e.g inside an auto regression step) it produces very different results during prediction (non train time). I have isolated a toy problem and the suspected reason is that the updated values are accumulated across all invocations and then updated using K.moving_average_update. In the example bellow we use momentum 0.5 on the means 2, 4, 8 then we should get 6.5 = ((0.5 * 0 + 0.5 * 2) * 0.5 + 0.5 * 4) + 0.5 * 8 or 2.33 = 0.5 * 0 +0.5* (2 + 4 + 8) / 3 but we have (with tensorflow) 7 = 0.5 * 0 + 0.5 * (2 + 4 + 8) also if we use momentum=0.1 for example the output is 12.666 = 0.1 * + 0.9 * (2 + 4 + 8)

With theano, only the first invocation is considered

import numpy as np
from keras.models import Model
from keras.layers import Layer, Input, Dense
from keras import initializers
import keras.backend as K

class CollectAvg(Layer):
    def build(self, input_shape):
        self.moving_mean = self.add_weight(shape=(input_shape[1], ), name='moving_mean',
                                           initializer=initializers.get('zeros'),
                                           trainable=False)
        self.built = True

    def call(self, inputs):
        mean = K.mean(inputs)
        self.add_update([K.moving_average_update(self.moving_mean, mean, 0.5)], inputs)
        return inputs

def get_shared_model():
    x = shared_input = Input(shape=(1, ))

    w = np.array([2]).reshape((1, 1, 1))
    x = Dense(1, use_bias=False, weights = w)(x)

    x = CollectAvg()(x)

    shared_model = Model(inputs=[shared_input], outputs=[x]) 
    return shared_model

def get_auto_regressive(submodel, n):
    x = model_input = Input(shape=(1, ))
    outputs = []
    for i in range(n):
        x = submodel(x)
        outputs.append(x)

    model = Model(inputs=[model_input], outputs = outputs)
    model.compile(optimizer='sgd', loss='mse')
    return model

submodel = get_shared_model()
submodel.summary()
model = get_auto_regressive(submodel, 3)
model.summary()
sample_input = np.asarray([[1]])
for i in range(2):
    sample_output = model.predict(sample_input)
    model.train_on_batch(sample_input, sample_output)
    print ("collected avg:",  submodel.layers[-1].get_weights())
print ("outputs:", np.array(sample_output))

with output being

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 1         
_________________________________________________________________
collect_avg_1 (CollectAvg)   (None, 1)                 1         
=================================================================
Total params: 2
Trainable params: 1
Non-trainable params: 1
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
model_1 (Model)              (None, 1)                 2         
=================================================================
Total params: 2
Trainable params: 1
Non-trainable params: 1
_________________________________________________________________

for theano:

collected avg: [array([ 1.], dtype=float32)]
collected avg: [array([ 1.5], dtype=float32)]
outputs: [[[ 2.]]

 [[ 4.]]

 [[ 8.]]]

for tensorflow

collected avg: [array([ 7.], dtype=float32)]
collected avg: [array([ 3.5], dtype=float32)]
outputs: [[[ 2.]]

 [[ 4.]]

 [[ 8.]]]
barvinograd commented 7 years ago

updated for inconsistent behavior with theano and tf backends

datumbox commented 7 years ago

I've been observing strange results with BN on Siamese models. Still trying to debug what is going on with it. @barvinograd got any update on this?

barvinograd commented 7 years ago

A Siamese architecture will produce the same bug. I have narrowed down the problem to the gather function. Both in tensorflow and in theano they will only execute one update, either the last one or the first, depending on the backend. On Mon, 31 Jul 2017 at 13:04 Vasilis Vryniotis notifications@github.com wrote:

I've been observing strange results with BN on Siamese models. Still trying to debug what is going on with it. @barvinograd https://github.com/barvinograd got any update on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/6623#issuecomment-319024677, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM5dW6xG1ZLxBql_eQVeffqPRVQ4-l7ks5sTaa7gaJpZM4NaZF5 .

datumbox commented 7 years ago

@barvinograd This bug was driving nuts, thanks for confirming. Since both backends are affected I assume this is an issue on Keras, right?

If I can help let me know cause I'm looking also for a fix.

barvinograd commented 7 years ago

I actually would consider it a bug on both frameworks and not on keras. I am also not sure how to create a workaround in keras other than not use gather when using the same update multiple times. Would like to be aware of other options if they exist.

On Mon, Jul 31, 2017 at 5:26 PM Vasilis Vryniotis notifications@github.com wrote:

@barvinograd https://github.com/barvinograd This bug was driving nuts, thanks for confirming. Since both backends are affected I assume this is an issue on Keras, right?

If I can help let me know cause I'm looking also for a fix.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/6623#issuecomment-319082921, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM5darFc843TmgryujXNBEXR1qJTAcKks5sTeQsgaJpZM4NaZF5 .

datumbox commented 7 years ago

I'm not sure to which gather you refer, I've checked the moving_average_update() on backend but I can't find such a call. Could you please point me to the right line?

My setup is simpler than yours. I have a CNN which is a building block for a siamese with two inputs. The inputs are passed through the same CNN and their outputs are combined and further processed with a couple of layers. I know that the BN is to blame as the problem appears only when I include them in the CNN. Also this happens only when I finetune the network, which is weird since the moving mean/var are always updated (no matter what value the "trainable" flag has).

monaj07 commented 7 years ago

I have a Siamese network with BN layers:

img_input = Input(shape=input_size+[3])
x  = Conv2D(filters=64, kernel_size=3, strides=(1,1), padding='same')(img_input)
x  = BatchNormalization(axis=-1)(x)
x  = Activation('relu')(x)
x  = MaxPooling2D(pool_size=2)(x)
x  = Conv2D(filters=64, kernel_size=3, strides=(1,1), padding='same')(x)
x  = BatchNormalization(axis=-1)(x)
x  = Activation('relu')(x)
x  = MaxPooling2D(pool_size=2)(x)
branch = Model(inputs=img_input, outputs=x)

out_b1  = branch(input_data_1)
out_b2  = branch(input_data_2)
out_bf1  = Flatten()(out_b1)
out_bf2  = Flatten()(out_b2)
b_diff  = Lambda((lambda vecs:tf.sqrt((vecs[0]-vecs[1])**2+0.000000001)))([out_bf1,out_bf2])
dense1  = Dense(128, activation='relu')(b_diff)
output  = Dense(2, activation='linear', name='Softmax_output')(dense1)

When I use train_on_batch in Keras, it works all fine and the network trains much faster with BN layers. However when I switch to a Tensorflow session with learning_phase():1 in training and learning_phase():0 in validation, the architecture can not be trained and the training accuracy remains at around 50%. Is it a known issue as well?

datumbox commented 7 years ago

@monaj07 As far as I understand, you use the network for near-duplicate detection. I work on a similar problem and face the exact same issue. There seems to be an bug that affects the BatchNormalization layer when it is included in a siamese network.

My way around the problem was to use an embedded base_model without BN layers (for example a VGG-like structure). By doing so, the results that you get are the same despite the learning_phase value.

@fchollet could you give us your input?

@barvinograd might be worth changing the title to "BatchNormalization does not work with autoregressive and siamese models" :)

barvinograd commented 7 years ago

@datumbox, my bad. I meant to write tf.group, not gather. Similarly in Theano. The calls are made by the Function class in backends. They are called from _make_train_function in Model class. The group for each framework will only use one update per variable. When applying the same batch norm layer multiple times in the same graph, all but one update will be ignored. The first or last update will be consider, depending on the framework (did not check with CNTK) Moreover the implementation of tensorflow backend for moving average will create more bugs in the context of multiple applications in the same graph, at least when I last checked it when the issue was posted. Just using the simple x_1 * (1-a) + x_2 * a will produce the behavior I described above.

barvinograd commented 7 years ago

@datumbox, @monaj07 I don't know what you use case (i.e. data) for the Siamese network, but if you randomly use the branches regardless of the data and use a naive implementation like in the theano backend you might be able to still use batchnorm effectively - just consider that the moving averages will be collected from just one branch. If you have different domains (e.g. photos and drawings) running on each branch - you will experience the same problem as I did.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

barvinograd commented 5 years ago

This is still a problem

jonathantompson commented 5 years ago

@barvinograd, did you ever resolve this issue? I was just about to implement what you described.