Cyclical Learning Rates for Training Neural Networks

by Leslie N. Smith

Experiments with numerous functional forms, such as a triangular window (linear), a Welch window (parabolic) and a Hann window (sinusoidal) all produced equivalent results.
A more practical reason as to why CLR works is that, it is likely the optimum learning rate will be between the bounds and near optimal learning rates will be used throughout training
And also that saddle point have small gradients which can slow the learning, increasing the LR allows more rapid traversal of saddle points.

It is a “LR range test”; run your model for several epochs while letting the learning rate increase linearly between low and high LR values.
Next, plot the accuracy versus learning rate. Note the learning rate value when the accuracy starts to increase and when the accuracy slows, becomes ragged, or starts to fall. These two learning rates are good choices for bounds
Alternatively, one can use the rule of thumb that the optimum learning rate is usually within a factor of two of the largest one that converges and set base_lr to 1/3 or 1/4 of max lr.

nishnik / Paper-Leaf