Hi, I was looking through the code and noticed something strange.
This function, is supposed to implement RMSNorm, from Zhang, Biao, and Rico Sennrich. "Root mean square layer normalization." Advances in Neural Information Processing Systems 32 (2019).
But instead of dividing by the appropriate coefficient, it multiplies.
If the square of entries of the vector is already n, this makes no difference, but if it is anything else, it will make larger vectors larger and smaller vectors smaller, away from that value, opposite to intended functionality.
Hi, I was looking through the code and noticed something strange.
This function, is supposed to implement RMSNorm, from Zhang, Biao, and Rico Sennrich. "Root mean square layer normalization." Advances in Neural Information Processing Systems 32 (2019).
But instead of dividing by the appropriate coefficient, it multiplies.
https://github.com/meta-llama/llama/blob/8fac8befd776bc03242fe7bc2236cdb41b6c609c/llama/model.py#L52-L63
If the square of entries of the vector is already n, this makes no difference, but if it is anything else, it will make larger vectors larger and smaller vectors smaller, away from that value, opposite to intended functionality.