Cyclical Learning Rates for Training Neural Networks

Abstract

propose a cyclical learning rates which lets the learning rate cyclically vary between reasonable boundary values
- it eliminates the need to experimentally find the best values and schedules for the global learning rates
empirical results demonstrated on CIFAR-10, CIFAR-100 and ImageNet models

Conventional wisdom says
- too small learning rate -> slow convergence
- too high learning rate -> diverge
- learning rate should be a single value that monotonically decreases during training
This paper proposes that varying learning rate is beneficial overall
- as shown in Figure 1, CLR outperforms other learning policies

Adaptive learning rates
- Adam, RMSProp, AdaDelta, AdaSecant adaptively chooses learning rate according to running average of magnitudes of recent gradients, with large computational cost

Optimal Learning Rates
- optimal min/max boundary for cyclical learning rate can be determined via LR range test (running your model for several epochs while letting the learning rate increase linearly between low and high LR values)
- according to Figure 3, min_lr=0.001 and max_lr=0.006 because after lr=0.006, the accuracy fluctuates
Cyclical function
- linear, parabolic (Welch window), sinusoidal (Hann window) all produce similar results, so choose linear
Variations
- triangular : learning rate cycles between min_lr and max_lr
- triangular2 : learning rate is cut in half after each cycle of triangular
- exp_range : each boundary value declines by exponential factor of gamma ^ iteration

Comparison of CLR with adaptive learning rate methods in CIFAR-10
- mixed results in accuracy, but definitely faster in terms of iteration
ResNet, Stochastic Depth, DenseNet in CIFAR-10 & CIFAR-100
- average of 5 runs, definitely better performance
ImageNet / AlexNet architecture
- fixed is almost equal to triangular2, which means the optimal learning rate was already found (but note that numerous grid searches has been conducted in fixed to reach optimal learning rate, whereas CLR reaches similar performance only after LR range test)
ImageNet / GoogleNet architecture
- LR range test : min_lr=0.01, max_lr=0.026
- triangular2 outperforms fixed by 1.5%