Bidirectional Wrapper: regularization is not applied to reverse network

drspiffy commented 7 years ago

The regularization specified in the bias_regularizer, kernel_regularizer and recurrent_regularizer parameters in the definition of the GRU unit wrapped by the Bidirectional wrapper appear not to be applied to the reverse layer. Here is my definition of such a layer:

model.add(Bidirectional(GRU(hidden_size, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', return_sequences=True, bias_regularizer=l2(l=B_REG), kernel_regularizer=l2(l=W_REG), recurrent_regularizer=l2(l=W_REG), dropout=DROPOUT, recurrent_dropout=DROPOUT, implementation=2, unroll=False), merge_mode='concat', input_shape=(None, input_size)))

Below is a plot of the distribution of weights (The vertical line extends from +1σ to -1σ) as a function of epoch for training a model where the recurrent_regularizer was zero but the bias_regularizer and kernel_regularizer were non-zero. The effect of regularization can clearly be seen in the input weights and biases for the forward layer but not for the reverse layer.

[x] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
[x] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[x] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

drspiffy commented 7 years ago

It seems that the configurations of forward_gru_1 and forward_gru_2 are both correct and specify regularization. For example:

{'name': 'forward_gru_1', 'trainable': True, 'return_sequences': True, 'return_state': False, 'go_backwards': False, 'stateful': False, 'unroll': False, 'implementation': 2, 'units': 128, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'recurrent_initializer': {'class_name': 'Orthogonal', 'config': {'gain': 1.0, 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'recurrent_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'bias_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 9.999999747378752e-05}}, 'activity_regularizer': None, 'kernel_constraint': None, 'recurrent_constraint': None, 'bias_constraint': None, 'dropout': 0.1, 'recurrent_dropout': 0.1}

{'name': 'backward_gru_1', 'trainable': True, 'return_sequences': True, 'return_state': False, 'go_backwards': True, 'stateful': False, 'unroll': False, 'implementation': 2, 'units': 128, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'recurrent_initializer': {'class_name': 'Orthogonal', 'config': {'gain': 1.0, 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'recurrent_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'bias_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 9.999999747378752e-05}}, 'activity_regularizer': None, 'kernel_constraint': None, 'recurrent_constraint': None, 'bias_constraint': None, 'dropout': 0.1, 'recurrent_dropout': 0.1}

drspiffy commented 7 years ago

Additional note:

Independent of backend (TensorFlow/Theano)
Independent of implementation (0,1,2)
Demonstrated in versions: 2.0.3, 2.0.4, 2.0.6, 2.0.8, 2.1.1
Not present in versions: 1.2.2, 2.0.0^*, 2.0.1^*, 2.0.2^*

Note: * With fix for issue #5820, #5939 applied

drspiffy commented 7 years ago

https://gist.github.com/drspiffy/27be4f317058ec22fa113f770f5e313e

A gist which should evidence this problem in training

PetraZ commented 6 years ago

I find the same problem after training a bidirectional GRU. Forward layer weights square sum is 1000 times less than backward weights square sum. Basically, it means no regularization added on backward layer. However, I had a look on source code and could not find anything wrong there.... This bug does exist!

toosyou commented 6 years ago

A similar problem found here. I was using bidirectional LSTM layer with both kernel_regularizer and recurrent_regularizer. But the histograms of forward and backward layer acted differently. 2018-02-28 4 10 22

Anyone has any idea?

reidjohnson commented 6 years ago

Just a note that this issue can be closed. Pull request #10012 fixes the bug and has been merged.

drspiffy commented 6 years ago

Thanks for the reminder

keras-team / keras

Bidirectional Wrapper: regularization is not applied to reverse network #7514