propose a cyclical learning rates which lets the learning rate cyclically vary between reasonable boundary values
it eliminates the need to experimentally find the best values and schedules for the global learning rates
empirical results demonstrated on CIFAR-10, CIFAR-100 and ImageNet models
Details
Introduction
Conventional wisdom says
too small learning rate -> slow convergence
too high learning rate -> diverge
learning rate should be a single value that monotonically decreases during training
This paper proposes that varying learning rate is beneficial overall
as shown in Figure 1, CLR outperforms other learning policies
Related Works
Adaptive learning rates
Adam, RMSProp, AdaDelta, AdaSecant adaptively chooses learning rate according to running average of magnitudes of recent gradients, with large computational cost
Cyclical Learning Rates
Optimal Learning Rates
optimal min/max boundary for cyclical learning rate can be determined via LR range test (running your model for several epochs while letting the learning rate increase linearly between low and high LR values)
according to Figure 3, min_lr=0.001 and max_lr=0.006 because after lr=0.006, the accuracy fluctuates
Cyclical function
linear, parabolic (Welch window), sinusoidal (Hann window) all produce similar results, so choose linear
Variations
triangular : learning rate cycles between min_lr and max_lr
triangular2 : learning rate is cut in half after each cycle of triangular
exp_range : each boundary value declines by exponential factor of gamma ^ iteration
Experiments
Comparison of CLR with adaptive learning rate methods in CIFAR-10
mixed results in accuracy, but definitely faster in terms of iteration
ResNet, Stochastic Depth, DenseNet in CIFAR-10 & CIFAR-100
average of 5 runs, definitely better performance
ImageNet / AlexNet architecture
fixed is almost equal to triangular2, which means the optimal learning rate was already found (but note that numerous grid searches has been conducted in fixed to reach optimal learning rate, whereas CLR reaches similar performance only after LR range test)
ImageNet / GoogleNet architecture
LR range test : min_lr=0.01, max_lr=0.026
triangular2 outperforms fixed by 1.5%
Personal Thoughts
Good starting point for new task, new architecture, new dataset
Abstract
cyclical learning rates
which lets the learning rate cyclically vary between reasonable boundary valuesDetails
Introduction
Related Works
Cyclical Learning Rates
LR range test
(running your model for several epochs while letting the learning rate increase linearly between low and high LR values)min_lr=0.001
andmax_lr=0.006
because afterlr=0.006
, the accuracy fluctuatestriangular
: learning rate cycles betweenmin_lr
andmax_lr
triangular2
: learning rate is cut in half after each cycle oftriangular
exp_range
: each boundary value declines by exponential factor ofgamma ^ iteration
Experiments
fixed
is almost equal totriangular2
, which means the optimal learning rate was already found (but note that numerous grid searches has been conducted infixed
to reach optimal learning rate, whereas CLR reaches similar performance only afterLR range test
)LR range test
:min_lr=0.01
,max_lr=0.026
triangular2
outperformsfixed
by 1.5%Personal Thoughts
Link : https://arxiv.org/pdf/1506.01186.pdf Authors : Smith et al. 2017