Inconsistent use of epsilon

cossio commented 2 years ago

Hello, I noticed an inconsistency in the paper with the epsilon parameter. In the main-text,

whereas in the supplementary materials:

The two are not equivalent since in the first case, eps is accumulated in each iteration in s_t, which bias the estimate of the variance by (number of iterations) * epsilon, if I'm not mistaken.

Is this a typo? If so, what is the "correct" version and which one is implemented in the repo? Thanks!

juntang-zhuang commented 2 years ago

The first one is correct, I forgot to modify it in the appendix.

cossio commented 2 years ago

Thanks. Can you comment briefly on the motivation for putting the epsilon here? Perhaps this is explained somewhere in the latest version of the paper but I don't see it.

juntang-zhuang commented 2 years ago

Hi, this is actually due to my carelessness at the very beginning, because Adam implementation uses add_ everywhere so I used it, without noticing that add_ and add are different. After some people pointed this out, I think about it for a long time, and my impression is that add_ will gradually make denominator stiff, so in the late phase of training, the oscillation is not large if the accumulated epsilon is large compared to g^2.

However, add_ will not keep the effective epsilon in denominator increasing without an upper bound, because s_{t+1}=\beta s_t + (1-\beta) g^2 + \epsilon, if as 't \to \infty', s_t and gt^2 has some limit, then the have the relation 's = g^2 + \epsilon / (1-\beta)'. So just talking about effective '\epsilon', using `addis like using\epsilonin Adam at the beginning, and gradually growing to using\epsilon / (1-\beta)` in Adam, it will somehow stabilize the training process.

cossio commented 2 years ago

I see, thanks for the reply, it clears things. I'll close this.

juntang-zhuang / Adabelief-Optimizer

Inconsistent use of epsilon #61