keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.09k stars 19.48k forks source link

GRU + Large `recurrent_dropout` Bug #20276

Open nokados opened 1 month ago

nokados commented 1 month ago

Keras version: 3.5.0
Backend: TensorFlow 2.17.0

I encountered a strange bug when working with the GRU layer. If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:

  1. With sequence length 20: Everything works as expected.
  2. With sequence length 100: The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.
  3. With sequence length 145: The behavior is unstable. I received the following warning:
   Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f6311231eb0>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/rnn.py", line 419, in <genexpr>
    ta.write(ta_index_to_write, out)  File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/tensorflow/python/util/tf_should_use.py", line 288, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs),

I was unable to reproduce this behavior in Colab; there, either the loss becomes inf, or it behaves similarly to the longer sequence lengths.

  1. With sequence length 200: It throws an error:
Epoch 1/50

2024-09-21 22:10:35.493005: I tensorflow[/core/framework/local_rendezvous.cc:404](http://localhost:8888/core/framework/local_rendezvous.cc#line=403)] Local rendezvous is aborting with status: INVALID_ARGUMENT: indices[0] = 2648522 is not in [0, 25601)

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
Cell In[15], line 1
----> 1 model.fit(
      2     dataset, onehot_target,
      3     batch_size=128,
      4     epochs=50,     
     5 )

File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py:122](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py#line=121), in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py:136](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py#line=135), in indexed_slices_union_indices_and_values.<locals>.values_for_union(indices_expanded, indices_count, values)
    132 to_union_indices = tf.gather(indices_indices, union_indices)
    133 values_with_leading_zeros = tf.concat(
    134     [tf.zeros((1,) + values.shape[1:], values.dtype), values], axis=0
    135 )
--> 136 return tf.gather(values_with_leading_zeros, to_union_indices)

InvalidArgumentError: {{function_node __wrapped__GatherV2_device_[/job](http://localhost:8888/job):localhost[/replica:0](http://localhost:8888/replica#line=-1)[/task:0](http://localhost:8888/task#line=-1)[/device](http://localhost:8888/device):CPU:0}} indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2] name:

Key points:

Irrelevant factors:

I have prepared a minimal reproducible example in Colab. Here is the link: https://colab.research.google.com/drive/1msGuYB5E_eg_IIU_YK4cJcWrkEm3o0NL?usp=sharing.

mehtamansi29 commented 1 month ago

Hi @nokados -

Here as the points you mentioned:

  1. With sequence length 100: Here model is simple and input_sequence_length is more. Addition of dense layer after GRU with tanh activation function to increase model complexity will reduce the loss.

  2. With sequence length 145: Here also along with additional dense layer need to increase units of GRU layer and dense layer will generate good result with accuracy and loss.

  3. With sequence length 200: Here we need to use Adam optimizer with learning_rate and recurrent_dropout=0.2 with same units and layer which use with sequence length 145 will give proper training without error. Here we increase model_complexity with GRU layer so rmsprop is not more adaptive.

And reducing recurrent_dropout is leading to underfitting so reducing the recurrent_dropout will get learn pattern of input data easily.

Here the gist shows all the changes mentioned with different sequence length. Let me know anything more required...!!!

nokados commented 1 month ago

This seems more like a workaround than a solution to the original problem. Adding an extra layer with a tangent function doesn't address the issue but merely "hides" the enormous outputs from the GRU, restricting them to the range of -1 to 1 with the tangent. However, the problem is that these values should already be in this range after the GRU, as tanh is already built into it. Mathematically, it shouldn't produce values like -2.5e25. The same behavior is expected from Keras as well.

hertschuh commented 1 month ago

@nokados ,

The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.

The tanh activation is not applied to the actual output of GRU, it's applied to intermediate calculations. The output of the GRU can be outside of [-1, 1], there's nothing that prevents that.

If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:

With sequence length 20: Everything works as expected. With sequence length 100: The output of the GRU layer during training [...] produces very large values in the range of ±1e25

What happens is that the recurrent_dropout is applied on intermediate state for each item in the sequence. So with a sequence length of 100, the recurrent_dropout of 0.5 is applied a hundred times. Almost all the state gets dropped, to the point that the math becomes meaningless and the model cannot learn.

To avoid this, you have to adapt the recurrent_dropout to the sequence length. A recurrent_dropout of 0.5 may be ok for a sequence length of 20, but as you experimented, with a sequence length of 100, a recurrent_dropout of 0.1 is probably more adapted.

nokados commented 1 month ago

A. Improper behavior:

1) It has worked until keras 3 2) It works well with LSTM 3) Let's look at the GRU math:

Update gate:

$$ z_t = \sigma(W_z \cdot x_t + Uz \cdot h{t-1} + b_z) \ $$

From 0 to 1.

Reset gate:

$$ r_t = \sigma(W_r \cdot x_t + Ur \cdot h{t-1} + b_r) $$

From 0 to 1.

Candidate hidden state:

$$ \tilde{h}_t = \tanh(W_h \cdot x_t + U_h \cdot (rt \odot h\{t-1}) + b_h) $$

From -1 to 1.

New hidden state:

$$ h_t = (1 - zt) \odot h{t-1} + z_t \odot \tilde{h}_t $$

Correct?

At each recurrent step the maximum difference between $h_{t-1}$ and $h_t$ weigths is 1. So after 100 steps, h_t should be less than 100 $\pm$ $h_0$ that is 0. Practically, they are in [-0.1, 0.1] range without recurrent dropout.

This behavior remains the same for the model before fitting, so we can ignore the trainable weights for now.

Also, look at the relationship: recurrent_dropout

How is dropout applied? I am not sure, but I guess it happens under tanh at the $\tilde{h}_t$ calculations, so it shouldn't impact on the output limits.

B. Other problems

What about exceptions? Is it okay if too large a recurrent_dropout causes indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2]. Could this be a memory issue?