lucidrains / minGRU-pytorch

Implementation of the proposed minGRU in Pytorch
MIT License
250 stars 17 forks source link

Does this implementation handle the vanishing gradient problem? #8

Closed ParikhKadam closed 1 month ago

ParikhKadam commented 1 month ago

We all know RNNs have this problem. While the paper "Were RNNs All We Needed?" focuses on parallelism, does it also lay down any changes to handle vanishing gradients?

Just curious.

dame-cell commented 1 month ago

I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients

lucidrains commented 1 month ago

I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients

yes exactly, and also there is not a nonlinearity (usually a tanh) as the hidden state propagates along the sequence dimension, which is the reason it can be parallelized with the associative scan (here formalized in log space with heinsen)

this comes with its own costs however

if you wish to have a "real" simplified RNN, take a look at this repo