lucidrains / minGRU-pytorch

Implementation of the proposed minGRU in Pytorch
MIT License
201 stars 11 forks source link

Does this implementation handle the vanishing gradient problem? #8

Closed ParikhKadam closed 2 weeks ago

ParikhKadam commented 2 weeks ago

We all know RNNs have this problem. While the paper "Were RNNs All We Needed?" focuses on parallelism, does it also lay down any changes to handle vanishing gradients?

Just curious.

dame-cell commented 2 weeks ago

I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients

lucidrains commented 2 weeks ago

I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients

yes exactly, and also there is not a nonlinearity (usually a tanh) as the hidden state propagates along the sequence dimension, which is the reason it can be parallelized with the associative scan (here formalized in log space with heinsen)

this comes with its own costs however

if you wish to have a "real" simplified RNN, take a look at this repo