keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.46k forks source link

Bidirectional Wrapper: regularization is not applied to reverse network #7514

Closed drspiffy closed 6 years ago

drspiffy commented 7 years ago

The regularization specified in the bias_regularizer, kernel_regularizer and recurrent_regularizer parameters in the definition of the GRU unit wrapped by the Bidirectional wrapper appear not to be applied to the reverse layer. Here is my definition of such a layer:

model.add(Bidirectional(GRU(hidden_size, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', return_sequences=True, bias_regularizer=l2(l=B_REG), kernel_regularizer=l2(l=W_REG), recurrent_regularizer=l2(l=W_REG), dropout=DROPOUT, recurrent_dropout=DROPOUT, implementation=2, unroll=False), merge_mode='concat', input_shape=(None, input_size)))

Below is a plot of the distribution of weights (The vertical line extends from +1σ to -1σ) as a function of epoch for training a model where the recurrent_regularizer was zero but the bias_regularizer and kernel_regularizer were non-zero. The effect of regularization can clearly be seen in the input weights and biases for the forward layer but not for the reverse layer.

image

drspiffy commented 7 years ago

It seems that the configurations of forward_gru_1 and forward_gru_2 are both correct and specify regularization. For example:

{'name': 'forward_gru_1', 'trainable': True, 'return_sequences': True, 'return_state': False, 'go_backwards': False, 'stateful': False, 'unroll': False, 'implementation': 2, 'units': 128, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'recurrent_initializer': {'class_name': 'Orthogonal', 'config': {'gain': 1.0, 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'recurrent_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'bias_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 9.999999747378752e-05}}, 'activity_regularizer': None, 'kernel_constraint': None, 'recurrent_constraint': None, 'bias_constraint': None, 'dropout': 0.1, 'recurrent_dropout': 0.1}

{'name': 'backward_gru_1', 'trainable': True, 'return_sequences': True, 'return_state': False, 'go_backwards': True, 'stateful': False, 'unroll': False, 'implementation': 2, 'units': 128, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'recurrent_initializer': {'class_name': 'Orthogonal', 'config': {'gain': 1.0, 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'recurrent_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 1.9999999494757503e-05}}, 'bias_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 9.999999747378752e-05}}, 'activity_regularizer': None, 'kernel_constraint': None, 'recurrent_constraint': None, 'bias_constraint': None, 'dropout': 0.1, 'recurrent_dropout': 0.1}

drspiffy commented 7 years ago

Additional note:

Note: * With fix for issue #5820, #5939 applied

drspiffy commented 7 years ago

https://gist.github.com/drspiffy/27be4f317058ec22fa113f770f5e313e

A gist which should evidence this problem in training

PetraZ commented 6 years ago

I find the same problem after training a bidirectional GRU. Forward layer weights square sum is 1000 times less than backward weights square sum. Basically, it means no regularization added on backward layer. However, I had a look on source code and could not find anything wrong there.... This bug does exist!

toosyou commented 6 years ago

A similar problem found here. I was using bidirectional LSTM layer with both kernel_regularizer and recurrent_regularizer. But the histograms of forward and backward layer acted differently. 2018-02-28 4 10 22

Anyone has any idea?

reidjohnson commented 6 years ago

Just a note that this issue can be closed. Pull request #10012 fixes the bug and has been merged.

drspiffy commented 6 years ago

Thanks for the reminder