Open nokados opened 1 month ago
Hi @nokados -
Here as the points you mentioned:
With sequence length 100: Here model is simple and input_sequence_length is more. Addition of dense layer after GRU with tanh
activation function to increase model complexity will reduce the loss.
With sequence length 145: Here also along with additional dense layer need to increase units of GRU layer and dense layer will generate good result with accuracy and loss.
With sequence length 200: Here we need to use Adam optimizer with learning_rate and recurrent_dropout=0.2 with same units and layer which use with sequence length 145 will give proper training without error. Here we increase model_complexity with GRU layer so rmsprop is not more adaptive.
And reducing recurrent_dropout is leading to underfitting so reducing the recurrent_dropout will get learn pattern of input data easily.
Here the gist shows all the changes mentioned with different sequence length. Let me know anything more required...!!!
This seems more like a workaround than a solution to the original problem. Adding an extra layer with a tangent function doesn't address the issue but merely "hides" the enormous outputs from the GRU, restricting them to the range of -1 to 1 with the tangent. However, the problem is that these values should already be in this range after the GRU, as tanh is already built into it. Mathematically, it shouldn't produce values like -2.5e25. The same behavior is expected from Keras as well.
@nokados ,
The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.
The tanh
activation is not applied to the actual output of GRU, it's applied to intermediate calculations. The output of the GRU can be outside of [-1, 1], there's nothing that prevents that.
If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:
With sequence length 20: Everything works as expected. With sequence length 100: The output of the GRU layer during training [...] produces very large values in the range of ±1e25
What happens is that the recurrent_dropout
is applied on intermediate state for each item in the sequence. So with a sequence length of 100, the recurrent_dropout
of 0.5 is applied a hundred times. Almost all the state gets dropped, to the point that the math becomes meaningless and the model cannot learn.
To avoid this, you have to adapt the recurrent_dropout
to the sequence length. A recurrent_dropout
of 0.5 may be ok for a sequence length of 20, but as you experimented, with a sequence length of 100, a recurrent_dropout
of 0.1 is probably more adapted.
1) It has worked until keras 3 2) It works well with LSTM 3) Let's look at the GRU math:
Update gate:
$$ z_t = \sigma(W_z \cdot x_t + Uz \cdot h{t-1} + b_z) \ $$
From 0 to 1.
Reset gate:
$$ r_t = \sigma(W_r \cdot x_t + Ur \cdot h{t-1} + b_r) $$
From 0 to 1.
Candidate hidden state:
$$ \tilde{h}_t = \tanh(W_h \cdot x_t + U_h \cdot (rt \odot h\{t-1}) + b_h) $$
From -1 to 1.
New hidden state:
$$ h_t = (1 - zt) \odot h{t-1} + z_t \odot \tilde{h}_t $$
Correct?
At each recurrent step the maximum difference between $h_{t-1}$ and $h_t$ weigths is 1. So after 100 steps, h_t should be less than 100 $\pm$ $h_0$ that is 0. Practically, they are in [-0.1, 0.1] range without recurrent dropout.
This behavior remains the same for the model before fitting, so we can ignore the trainable weights for now.
Also, look at the relationship:
How is dropout applied? I am not sure, but I guess it happens under tanh at the $\tilde{h}_t$ calculations, so it shouldn't impact on the output limits.
What about exceptions? Is it okay if too large a recurrent_dropout causes indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2]
. Could this be a memory issue?
Keras version: 3.5.0
Backend: TensorFlow 2.17.0
I encountered a strange bug when working with the GRU layer. If you create a simple model with a GRU layer and set
recurrent_dropout=0.5
, very strange behavior occurs:tanh
activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.I was unable to reproduce this behavior in Colab; there, either the loss becomes
inf
, or it behaves similarly to the longer sequence lengths.Key points:
recurrent_dropout=0.5
. It works fine with smallerrecurrent_dropout
values, such as 0.1.Irrelevant factors:
rmsprop
;adam
did not throw errors but resulted inloss = nan
.dropout
does not affect the issue.I have prepared a minimal reproducible example in Colab. Here is the link: https://colab.research.google.com/drive/1msGuYB5E_eg_IIU_YK4cJcWrkEm3o0NL?usp=sharing.