keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
58 stars 27 forks source link

Adam (and other adaptive algorithms) use epsilon the wrong way #315

Closed pandrey-fr closed 1 year ago

pandrey-fr commented 1 year ago

System information.

Describe the problem.

The current implementation of Adam (and, in fact, of other adaptive algorithms such as Adagrad or RMSprop) is making a use of the epsilon hyper-parameter that is inconsistent with the literature (Adam paper, Algorithm 1), with other libraries (such as Torch and MXNET), and with the rationale behind it.

In the paper, the update rule for Adam is stated to be $\thetat = \theta{t-1} - \eta_t \hat{m_t} / (\sqrt{\hat{v_t}} + \epsilon)$, but the current Keras implementation instead applies $\thetat = \theta{t-1} - \eta_t \hat{m_t} / (\sqrt{\hat{v_t} + \epsilon})$.

Edit: my statement about Adam is wrong (the problem is not with the place of epsilon the computation but with $\sqrt{\hat{v_t}}$ - see the discussion in comments below), however I stand by it for the more general addition below:

More generally, for algorithms such as Adagrad or RMSprop, when the update rule is mathematically stated to be $\thetat = \theta{t-1} - \eta_t \textnormal{num} / \sqrt{\textnormal{den}}$, it is common practice to add an epsilon term to the denominator in order to prevent division by zero and/or numerical under or overflow. The practice in most frameworks, which in my humble opinion is coherent with the former rationale, is then to use $\thetat = \theta{t-1} - \eta_t \textnormal{num} / (\sqrt{\textnormal{den}} + \epsilon)$, hence considering that the square root is anecdotal, and that what matters is that the final denominator is greater or equal than epsilon. However, in keras, there is a consistent use of $\thetat = \theta{t-1} - \eta_t * \textnormal{num} / (\sqrt{\textnormal{den} + \epsilon})$.

Describe the current behavior.

As stated above, formulas make use of $\thetat = \theta{t-1} - \eta_t * \textnormal{num} / (\sqrt{\textnormal{den} + \epsilon})$.

Describe the expected behavior.

As stated above, formulas should make use of $\thetat = \theta{t-1} - \eta_t * \textnormal{num} / (\sqrt{\textnormal{den}} + \epsilon)$, out of consistency with other reference implementations, the literature (in the case of Adam), and the rationale behind the use of epsilon.

Please note that one cannot just change the value of the hyper-parameter to achieve a behaviour similar to that of other frameworks, since the adjusted epsilon would be dependent on the value of the base denominator, which is typically a function of the input gradients and/or a stateful variable (momentum, velocity...).

At the very least, if changing the formula is not deemed an adequate solution, I think that it could be worth documenting this point of discrepancy between keras and other frameworks, as it can have non-trivial effects on the reproducibility of experiments depending on the chosen backend framework.

Contributing.

Standalone code to reproduce the issue.

There is no code to run, but rather code to review. For Adam:

sachinprasadhs commented 1 year ago

Hi, Thanks for opening the issue and your effort to compare with other implementation. The current implementation of keras is inline with the paper and other implementation as per the formula $\thetat = \theta{t-1} - \eta_t \hat{m_t} / (\sqrt{\hat{v_t}} + \epsilon)$ In the keras implementation in the equation `variable.assign_sub((m alpha) / (tf.sqrt(v) + self.epsilon)) , the square root is applied only tov, it does not include square root operation forepsilon`. Correct me if I'm wrong or if you are referring to any other document. Thanks!

pandrey-fr commented 1 year ago

Hi, Thank you for your answer.

Adam

You are right, I misidentified the problem with Adam, however I think that there still is one:

The paper states: $\thetat = \theta{t-1} - \eta_t \hat{m_t} / (\sqrt{\hat{v_t}} + \epsilon)$ which can be rewritten, given the formulas for $\hat{m_t}$ and $\hat{v_t}$, as: $\thetat = \theta{t-1} - \eta_t (m_t / 1 - \beta_1^t) / (\sqrt{v_t / (1 - \beta_2^t)} + \epsilon)$

However in the current keras implementation we have: $\thetat = \theta{t-1} - \alpha m_t / (\sqrt{v_t} + \epsilon)$ with $\alpha = \eta_t \sqrt{1 - \beta_2^t} / (1 - \beta_1^t)$ which gives $\thetat = \theta{t-1} - \eta_t \hat{m_t} \sqrt{1 - \beta_2^t} / (\sqrt{v_t} + \epsilon)$ and it turns out that unless $\epsilon = 0$, $\sqrt{1 - \beta_2^t} / (\sqrt{v_t} + \epsilon) \neq 1 / (\sqrt{v_t / (1 - \beta_2^t)} + \epsilon)$ the right-hand term being the desired $1 / (\hat{v_t} + \epsilon)$

In other words, the issue is with the wrongful factorization of the bias-correction of the velocity state.

Adagrad and RMSProp

For these algorithms (and most probably for some others), I maintain that there is an issue with the way epsilon is used: for these algorithms, it is included in the square root of the base denominator term, which I think it should not.

pandrey-fr commented 1 year ago

To make my conclusion about Adam more readable: compared to the paper's formula, the current implementation divides $\epsilon$ by $\sqrt{1 - \beta_2^t}$ in the update rule, which results in multiplying epsilon by a factor that is initially greater than 1 and decreases towards 1 as time goes by.

sachinprasadhs commented 1 year ago

Hi, Thanks for clarifying.

As mentioned in the document, epsilon which you are referring in algorithm 1 is not implemented in keras, instead it is epsilon hat which is mentioned before the section 2.1 of the paper as below. image

https://github.com/keras-team/keras/blob/e6784e4302c7b8cd116b74a784f4b78d60e83c26/keras/optimizers/optimizer_experimental/adam.py#L54-L57 https://github.com/keras-team/keras/blob/e6784e4302c7b8cd116b74a784f4b78d60e83c26/keras/optimizers/optimizer_experimental/adam.py#L67-L74

pandrey-fr commented 1 year ago

Oh, I see. Thank you for pointing that out.

While the paper does not state it explicitly, my understanding is that it is implied that $\hat{\epsilon} = \epsilon * \sqrt{1 -\beta_2^t}$, i.e. that the raw epsilon should be adjusted to fit into the formula. Documenting that the input epsilon is $\hat{\epsilon}$ feels really weird to me, since by definition $\hat{\epsilon}$ is supposed to be adjusted from a constant $\epsilon$ value at each step, rather than set as a fixed hyper-parameter constant.

I would therefore have expected the code to look like

 sqrt_beta_2_power = tf.sqrt(1 - beta_2_power)
 alpha = lr * sqrt_beta_2_power / (1 - beta_1_power)
 # ...
 variable.assign_sub((m * alpha) / (tf.sqrt(v) + self.epsilon * sqrt_beta_2_power))

Asymptotically the formula becomes the same, and obviously the discrepancy between the keras implementation and the canonical formula is unlikely to hurt convergence. However it means that switching from keras to or from another framework (such as torch or mxnet) to implement a similar model architecture with deterministic initial weights is likely not to yield the same results, which is not great for experiments reproducibility.

One might argue that differences in computational backend are likely to provoke changes due to numerical instabilities anyway, but the order of magnitude of the discrepancies due to the change in Adam formula can be significant. Using some random normal-distributed fake gradients and running the first step of Adam with a 1.0 base learning rate, I ended up with differences in a $[10^{-6}, 10^{-3}]$ range, which is not a lot (and will be further shrinked when using actual base learning rate) but does not seem entirely trivial to me.

ianstenbit commented 1 year ago

Thanks for the discussion @pandrey-fr! Given the numerical magnitude of this change, as well as the fact that our docs call out that epsilon is in fact epsilon hat, let's keep the implementation as-is.

pandrey-fr commented 1 year ago

Thank you for your answer @ianstenbit, although it is quite disappointing, as it means getting the expected behavior requires hacking around the optimizer to manually update the epsilon at each step.

My bad perhaps for shipping two points into a single issue, but there is one point that was still left unanswered: what about the other optimizers (e.g. Adagrad and RMSProp) that use tf.sqrt(denominator + epsilon) rather than tf.sqrt(denominator) + epsilon?