In addition to this his suggestion, forwarding the hidden states has to be changed, too. Only a part of the hidden state has to be given to each layer. As I had to change the initialization of hidden states, I suggest initializing it from a uniform distribution instead of zeros (like in this article).
Based on a proposal by @danarte in issue #6.
In addition to this his suggestion, forwarding the hidden states has to be changed, too. Only a part of the hidden state has to be given to each layer. As I had to change the initialization of hidden states, I suggest initializing it from a uniform distribution instead of zeros (like in this article).