Adaptive methods such as Adam, Adagrad, RMSprop performa well in initial portion of training, but have been found to generalize poorly compared to SGD at the end
Propose SWATS, a simple strategy which SWitches from Adam To Sgd when a triggering condition is satisfied
Experiments on ImageClassification and LanguageModeling shows SWATS can close the generalization gap between SGD and Adam
Details
Why Adaptive Methods
SGD scales the gradient uniformly in all directions, which can be harmful for ill-scaled problems
To correct these shortcomings, adaptive methods diagonally scale the gradient via estimates of the function's curvature.
Although adaptive methods have been used in many applications, some authors show that for even simple quadratic problems, adaptive methods find solutions that can be orders-of-magnitude worse at generalization than those found by SGD
SWATS
show generalization gap between Adam and SGD using CIFAR-10 data
SGD vs Adam vs Adam-clip(0,1) vs Adam-clip(1, inf)
Adam-clip(0,1) performs similar to Adam, but Adam-clip(1, inf) closes the generalization gap. This stands as evidence that the step sizes learned by Adam could be too small for effective convergence ~ we need to lower-bound the step size of Adam
When to Switch
switching from Adam to SGD early leads to better generalization
Switch condition
the condition compares the bias-corrected exponential averaged value and the current value
What value to Switch
SWATS Algorithm
Condition we propose relates to the projection of Adam steps on the gradient subspace
By design, it does not increase number of hyper-parameter in the optimizer (when to switch and what value of SGD to switch is automatically calculated in the SWATS algorithm)
Results
SWATS perform well in ImageClassification task throughout various architectures
Adam is better in generalization error in LM task
Summary
note the value of SGD lr when switching, quite larger than best value for SGD-only lr
Discussions
Switching from Adam to SGD may incur short-time deterioration in performance, which usually recovers
Personal Thoughts
Optimizing learning policy is a difficult task to solve
Adam is fast but worse at performance
SGD is slow but better at performance
if we were to switch, when and by what value is of question
SWATS at least tries to perform faster and close the gap between performance!
good introduction
shows weakness of sgd, explains each adaptive methods, and wide range of related works
clip viz is a clever way to prove the weakness of adaptive methods
explanation on switchover point and switchover value was difficult to understand
Abstract
Details
Why Adaptive Methods
SWATS
When to Switch
What value to Switch
SWATS Algorithm
Results
Discussions
Personal Thoughts
Link : https://arxiv.org/pdf/1712.07628.pdf Authors : Keskar et al. 2017