The code implementation didn't correspond exactly to the equation we have in the layer normalization paper.
I also have doubts about normalizing all the gates, so for example, the forget gate will never be equal to zero du to the shift we add.
Isn't more logic to just keep the gates as they are and then just normalize cell state?
https://github.com/hardmaru/supercell/blob/063b01e75e6e8af5aeb0aac5cc583948f5887dd1/supercell.py#L216
The code implementation didn't correspond exactly to the equation we have in the layer normalization paper. I also have doubts about normalizing all the gates, so for example, the forget gate will never be equal to zero du to the shift we add. Isn't more logic to just keep the gates as they are and then just normalize cell state?
Thank you