Models with running statistics (optimizer momentum, batch norm, layer norm) generally benefit from a short period of zero, or linearly ramped learning rate which gives them time to converge onto good estimates of the population statistics without having the initial values throw off the initial weight updates.
Models with running statistics (optimizer momentum, batch norm, layer norm) generally benefit from a short period of zero, or linearly ramped learning rate which gives them time to converge onto good estimates of the population statistics without having the initial values throw off the initial weight updates.