ADADELTA - Githubissues

markroxor commented 6 years ago

The benefits of this approach are as follows: • no manual setting of a learning rate. • insensitive to hyperparameters. • separate dynamic learning rate per-dimension. • minimal computation over gradient descent. • robust to large gradients, noise and architecture choice. • applicable in both local or distributed environments.

markroxor commented 6 years ago

There are many modifications to the gradient descent algo-rithm. The most powerful such modification is Newton’s method which requires second order derivatives of the cost function: (3) ∆xt = H−1 t H−1 t gt is the inverse of the Hessian matrix of second where H−1 t H−1 t derivatives computed at iteration t. This determines the op-timal step size to take for quadratic problems, but unfortu-nately is prohibitive to compute in practice for large models. Therefore, many additional approaches have been proposed to either improve the use of first order information or to ap-proximate the second order information.

markroxor commented 6 years ago

One method to prevent this is to slow down the parameter updates by decreasing the learning rate. This can be done manually when the validation accuracy appears to plateau. Alternatively, learning rate schedules have been pro-posed [1] to automatically anneal the learning rate based on howmany epochs through the data have been done. These ap-proaches typically add additional hyperparameters to control how quickly the learning rate decays.

markroxor commented 6 years ago

ADAGRAD method sen-sitive to the choice of learning rate. Also, due to the continual accumulation of squared gradients in the denominator, the learning rate will continue to decrease throughout training, eventually decreasing to zero and stopping training com-pletely.

markroxor commented 6 years ago

oscil-lations are mitigated when using momentum because the sign of the gradient changes and thus the momentum term damps down these updates to slow progress across the valley.

markroxor commented 6 years ago

Algorithm 1 Computing ADADELTA update at time t Require: Decay rate ρ, Constant ? Require: Initial parameter x1 1: Initialize accumulation variables E[g2]0 = 0, E[∆x2]0 = 0 2: for t = 1 : T do %% Loop over # of updates 3: Compute Gradient: gt 4: +(1−ρ)g2 t Accumulate Gradient: E[g2]t = ρE[g2]t−1 +(1−ρ)g2 t 5: Compute Update: ∆xt = − RMS[∆x]t−1 E[g2]t = ρE[g2]t−1 +(1−ρ)g2 t 5: Compute Update: ∆xt = − RMS[∆x]t−1 gt RMS[g]t gt 5: Compute Update: ∆xt = − RMS[∆x]t−1 gt RMS[g]t 6: Accumulate Updates: E[∆x2]t RMS[∆x]t−1 gt RMS[g]t 6: Accumulate Updates: E[∆x2]t = ρE[∆x2]t−1+(1−ρ)∆x2 6: ρE[∆x2]t−1+(1−ρ)∆x2 t Compute Update: ∆xt = − RMS[∆x]t−1 gt RMS[g]t 6: Accumulate Updates: E[∆x2]t = ρE[∆x2]t−1+(1−ρ)∆x2 t 7: Apply Update: xt+1 = xt +∆xt Accumulate Updates: E[∆x2]t = ρE[∆x2]t−1+(1−ρ)∆x2 t 7: Apply Update: xt+1 = xt +∆xt 8: end for

markroxor commented 6 years ago

Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are ac-cumulated to be some fixed size w (instead of size t where t is the current iteration as in ADAGRAD). With this win-dowed accumulation the denominator of ADAGRAD cannot accumulate to infinity and instead becomes a local estimate using recent gradients.

markroxor commented 6 years ago

Since storing w previous squared gradients is inefficient, our methods implements this accumulation as an exponen-tially decaying average of the squared gradients. Assume at time t this running average is E[g2]t then we compute: (8) g2 t E[g2]t = ρ E[g2]t−1 +(1−ρ) g2 t where ρ is a decay constant similar to that used in the momen-tum method. Since we require the square root of this quantity in the parameter updates, this effectively becomes the RMS of previous squared gradients up to time t: (9) RMS[g]t =?E[g2]t +? where a constant ? is added to better condition the denomina-tor as in [5]. The resulting parameter update is then: η parameter update is then: η ∆xt = −RMS[g]t gt η ∆xt = −RMS[g]t

markroxor commented 6 years ago

∆xt for the current time step is not known, so we assume the curvature is locally smooth and approximate ∆xt by compute the exponentially decaying RMS over a window of size w of previous ∆x to give the ADADELTA method: RMS[∆x]t−1 (14) gt ∆xt = − RMS[g]t where the same constant ? is added to the numerator RMS as well.

markroxor commented 6 years ago

the RMS[∆x]t−1 quantity lags behind the de- nominator by 1 time step, due to the recurrence relationship In Eqn. 14 the RMS[∆x]t−1 quantity lags behind the de- nominator by 1 time step, due to the recurrence relationship for∆xt. An interesting side effect of this is that the system is robust to large sudden gradients which act to increase the de-nominator, reducing the effective learning rate at the current time step, before the numerator can react.

markroxor commented 6 years ago

First, the step sizes, or effective learning rates (all terms except gt from Eqn. 14) shown in the left portion of the figure are larger for the lower layers of the network and much smaller for the top layer at the beginning of training. This property of ADADELTA helps to balance the fact that lower layers have smaller gradients due to the diminishing gradi-

markroxor commented 6 years ago

Secondly, near the end of training these step sizes con-verge to 1. This is typically a high learning rate that would lead to divergence in most methods, however this conver-gence towards 1 only occurs near the end of training when the gradients and parameter updates are small. In this scenario, the ? constants in the numerator and denominator dominate the past gradients and parameter updates, converging to the learning rate of 1. This leads to the last interesting property of ADADELTA which is that when the step sizes become 1, the parameter updates (shown on the right of Fig. 2) tend towards zero. This occurs smoothly for each of the weight matrices effectively

markroxor commented 6 years ago

An annealing sched-ule could possibly be added to the ADADELTA method to

jellAIfish / jellyfish

ADADELTA #25