Closed HashmatShadab closed 1 year ago
Hi, First a small clarification, as the models we use predominantly contain LayerNorm, the above lines of code exclude LayerNorm and also BN (if present) from regularization.
Weight decay is a regularization that tries to make sure the weights across individual feature dimensions are equally weighted and that no individual direction dominates. And since normalization layers do not have products (no explicit weighting across dimensions), having a weight decay term for these layers wouldn't make sense as weight decay would just try to push the magnitude of the weights to 0 in this case. In the following paper, it is made clear explicitly if you want a detailed analysis. https://arxiv.org/pdf/1706.05350.pdf Hence, in practice, a lot of the modern architectures trained for image classification do not use regularization on normalization layers.
Hope this helps.
Thanks. It was quite helpful :)
https://github.com/nmndeep/revisiting-at/blob/932f73f248447addac17542a574ad7bc784c0cbd/main.py#L447-449