Improving Generalization Performance by Switching from Adam to SGD

kweonwooj commented 6 years ago

Abstract

Adaptive methods such as Adam, Adagrad, RMSprop performa well in initial portion of training, but have been found to generalize poorly compared to SGD at the end
Propose SWATS, a simple strategy which SWitches from Adam To Sgd when a triggering condition is satisfied
Experiments on ImageClassification and LanguageModeling shows SWATS can close the generalization gap between SGD and Adam

Details

Why Adaptive Methods
- SGD scales the gradient uniformly in all directions, which can be harmful for ill-scaled problems
- To correct these shortcomings, adaptive methods diagonally scale the gradient via estimates of the function's curvature.
- Although adaptive methods have been used in many applications, some authors show that for even simple quadratic problems, adaptive methods find solutions that can be orders-of-magnitude worse at generalization than those found by SGD
SWATS
- show generalization gap between Adam and SGD using CIFAR-10 data
- SGD vs Adam vs Adam-clip(0,1) vs Adam-clip(1, inf)
- Adam-clip(0,1) performs similar to Adam, but Adam-clip(1, inf) closes the generalization gap. This stands as evidence that the step sizes learned by Adam could be too small for effective convergence ~ we need to lower-bound the step size of Adam
When to Switch
- switching from Adam to SGD early leads to better generalization
- Switch condition
- the condition compares the bias-corrected exponential averaged value and the current value
What value to Switch
SWATS Algorithm
- Condition we propose relates to the projection of Adam steps on the gradient subspace
- By design, it does not increase number of hyper-parameter in the optimizer (when to switch and what value of SGD to switch is automatically calculated in the SWATS algorithm)
Results
- SWATS perform well in ImageClassification task throughout various architectures
- Adam is better in generalization error in LM task
- Summary
- note the value of SGD lr when switching, quite larger than best value for SGD-only lr
Discussions
- Switching from Adam to SGD may incur short-time deterioration in performance, which usually recovers

Personal Thoughts

Optimizing learning policy is a difficult task to solve
- Adam is fast but worse at performance
- SGD is slow but better at performance
- if we were to switch, when and by what value is of question
SWATS at least tries to perform faster and close the gap between performance!
good introduction
- shows weakness of sgd, explains each adaptive methods, and wide range of related works
- clip viz is a clever way to prove the weakness of adaptive methods
- explanation on switchover point and switchover value was difficult to understand

Link : https://arxiv.org/pdf/1712.07628.pdf Authors : Keskar et al. 2017

JaeDukSeo commented 6 years ago

Great post!

chg0901 commented 5 years ago

Is this optimizer implemented?

kweonwooj / papers

Improving Generalization Performance by Switching from Adam to SGD #76

Abstract

Details

Personal Thoughts