Closed barvinograd closed 3 years ago
updated for inconsistent behavior with theano and tf backends
I've been observing strange results with BN on Siamese models. Still trying to debug what is going on with it. @barvinograd got any update on this?
A Siamese architecture will produce the same bug. I have narrowed down the problem to the gather function. Both in tensorflow and in theano they will only execute one update, either the last one or the first, depending on the backend. On Mon, 31 Jul 2017 at 13:04 Vasilis Vryniotis notifications@github.com wrote:
I've been observing strange results with BN on Siamese models. Still trying to debug what is going on with it. @barvinograd https://github.com/barvinograd got any update on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/6623#issuecomment-319024677, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM5dW6xG1ZLxBql_eQVeffqPRVQ4-l7ks5sTaa7gaJpZM4NaZF5 .
@barvinograd This bug was driving nuts, thanks for confirming. Since both backends are affected I assume this is an issue on Keras, right?
If I can help let me know cause I'm looking also for a fix.
I actually would consider it a bug on both frameworks and not on keras. I am also not sure how to create a workaround in keras other than not use gather when using the same update multiple times. Would like to be aware of other options if they exist.
On Mon, Jul 31, 2017 at 5:26 PM Vasilis Vryniotis notifications@github.com wrote:
@barvinograd https://github.com/barvinograd This bug was driving nuts, thanks for confirming. Since both backends are affected I assume this is an issue on Keras, right?
If I can help let me know cause I'm looking also for a fix.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/6623#issuecomment-319082921, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM5darFc843TmgryujXNBEXR1qJTAcKks5sTeQsgaJpZM4NaZF5 .
I'm not sure to which gather you refer, I've checked the moving_average_update() on backend but I can't find such a call. Could you please point me to the right line?
My setup is simpler than yours. I have a CNN which is a building block for a siamese with two inputs. The inputs are passed through the same CNN and their outputs are combined and further processed with a couple of layers. I know that the BN is to blame as the problem appears only when I include them in the CNN. Also this happens only when I finetune the network, which is weird since the moving mean/var are always updated (no matter what value the "trainable" flag has).
I have a Siamese network with BN layers:
img_input = Input(shape=input_size+[3])
x = Conv2D(filters=64, kernel_size=3, strides=(1,1), padding='same')(img_input)
x = BatchNormalization(axis=-1)(x)
x = Activation('relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Conv2D(filters=64, kernel_size=3, strides=(1,1), padding='same')(x)
x = BatchNormalization(axis=-1)(x)
x = Activation('relu')(x)
x = MaxPooling2D(pool_size=2)(x)
branch = Model(inputs=img_input, outputs=x)
out_b1 = branch(input_data_1)
out_b2 = branch(input_data_2)
out_bf1 = Flatten()(out_b1)
out_bf2 = Flatten()(out_b2)
b_diff = Lambda((lambda vecs:tf.sqrt((vecs[0]-vecs[1])**2+0.000000001)))([out_bf1,out_bf2])
dense1 = Dense(128, activation='relu')(b_diff)
output = Dense(2, activation='linear', name='Softmax_output')(dense1)
When I use train_on_batch
in Keras, it works all fine and the network trains much faster with BN layers.
However when I switch to a Tensorflow session with learning_phase():1
in training and learning_phase():0
in validation, the architecture can not be trained and the training accuracy remains at around 50%.
Is it a known issue as well?
@monaj07 As far as I understand, you use the network for near-duplicate detection. I work on a similar problem and face the exact same issue. There seems to be an bug that affects the BatchNormalization layer when it is included in a siamese network.
My way around the problem was to use an embedded base_model without BN layers (for example a VGG-like structure). By doing so, the results that you get are the same despite the learning_phase value.
@fchollet could you give us your input?
@barvinograd might be worth changing the title to "BatchNormalization does not work with autoregressive and siamese models" :)
@datumbox, my bad. I meant to write tf.group
, not gather. Similarly in Theano. The calls are made by the Function
class in backends. They are called from _make_train_function
in Model
class. The group
for each framework will only use one update per variable. When applying the same batch norm layer multiple times in the same graph, all but one update will be ignored. The first or last update will be consider, depending on the framework (did not check with CNTK)
Moreover the implementation of tensorflow backend for moving average will create more bugs in the context of multiple applications in the same graph, at least when I last checked it when the issue was posted. Just using the simple x_1 * (1-a) + x_2 * a
will produce the behavior I described above.
@datumbox, @monaj07 I don't know what you use case (i.e. data) for the Siamese network, but if you randomly use the branches regardless of the data and use a naive implementation like in the theano backend you might be able to still use batchnorm effectively - just consider that the moving averages will be collected from just one branch. If you have different domains (e.g. photos and drawings) running on each branch - you will experience the same problem as I did.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
This is still a problem
@barvinograd, did you ever resolve this issue? I was just about to implement what you described.
When batchnorm is used inside a model that is invoked more than once (e.g inside an auto regression step) it produces very different results during prediction (non train time). I have isolated a toy problem and the suspected reason is that the updated values are accumulated across all invocations and then updated using
K.moving_average_update
. In the example bellow we use momentum 0.5 on the means 2, 4, 8 then we should get6.5 = ((0.5 * 0 + 0.5 * 2) * 0.5 + 0.5 * 4) + 0.5 * 8
or2.33 = 0.5 * 0 +0.5* (2 + 4 + 8) / 3
but we have (with tensorflow)7 = 0.5 * 0 + 0.5 * (2 + 4 + 8)
also if we use momentum=0.1 for example the output is12.666 = 0.1 * + 0.9 * (2 + 4 + 8)
With theano, only the first invocation is considered
with output being
for theano:
for tensorflow