Hi! I've been trying to reproduce the GSAM results. I noticed that in the code, the learning rate (LR) warmup starts from 0, which is lower than the minimum LR for the post-warmup decay. Because of this, the rho parameter, which is scheduled proportionally with the LR, has negative values early in training.
This does not seem intentional, as rho is never supposed to be negative according to the paper. I'm curious if this makes any difference to the results of the paper if fixed. My guess is that its a very small amount of training (1/3 of the first epoch) and wouldn't change anything.
Hi! I've been trying to reproduce the GSAM results. I noticed that in the code, the learning rate (LR) warmup starts from 0, which is lower than the minimum LR for the post-warmup decay. Because of this, the rho parameter, which is scheduled proportionally with the LR, has negative values early in training.
This does not seem intentional, as rho is never supposed to be negative according to the paper. I'm curious if this makes any difference to the results of the paper if fixed. My guess is that its a very small amount of training (1/3 of the first epoch) and wouldn't change anything.
@lucasb-eyer @juntang-zhuang