Does this implementation handle the vanishing gradient problem?

ParikhKadam commented 1 month ago

We all know RNNs have this problem. While the paper "Were RNNs All We Needed?" focuses on parallelism, does it also lay down any changes to handle vanishing gradients?

Just curious.

dame-cell commented 1 month ago

I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients

lucidrains commented 1 month ago

I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients

yes exactly, and also there is not a nonlinearity (usually a tanh) as the hidden state propagates along the sequence dimension, which is the reason it can be parallelized with the associative scan (here formalized in log space with heinsen)

this comes with its own costs however

if you wish to have a "real" simplified RNN, take a look at this repo

lucidrains / minGRU-pytorch

Does this implementation handle the vanishing gradient problem? #8