Why adding bias to the forget gate?

https://github.com/hardmaru/supercell/blob/063b01e75e6e8af5aeb0aac5cc583948f5887dd1/supercell.py#L216

The code implementation didn't correspond exactly to the equation we have in the layer normalization paper. I also have doubts about normalizing all the gates, so for example, the forget gate will never be equal to zero du to the shift we add. Isn't more logic to just keep the gates as they are and then just normalize cell state?

Thank you

hardmaru / supercell

Why adding bias to the forget gate? #7