Closed ParikhKadam closed 1 month ago
I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients
I'm assuming since the authors remove the previous hidden state dependencies from the gates, which simplifies the model and reduces the number of parameters .this somehow helps with the vanishing gradients
yes exactly, and also there is not a nonlinearity (usually a tanh) as the hidden state propagates along the sequence dimension, which is the reason it can be parallelized with the associative scan (here formalized in log space with heinsen)
this comes with its own costs however
if you wish to have a "real" simplified RNN, take a look at this repo
We all know RNNs have this problem. While the paper "Were RNNs All We Needed?" focuses on parallelism, does it also lay down any changes to handle vanishing gradients?
Just curious.