srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

Learning rate scheduling #195

Closed efosler closed 5 years ago

efosler commented 5 years ago

I noticed particularly when training big nets that the default learning rate mechanism is pretty inefficient. By and large, after the initial burn in and first rate drop, the system spends at most one fruitful epoch at each learning rate before having to backtrack. The "newbob" rate schedule is probably more efficient than the default (which is typically flat until no improvement, then halves until no improvement, then stops). One could have the same burn-in period as well.

I'm planning to implement a separate learning rate scheduler module that can be switched in, which will allow for different implementations. I'll keep the current as a default, but we may want to consider switching to newbob at some point.

Plus I love the idea of "newbob" continuing to live on...

fmetze commented 5 years ago

That would be great. Newbob came out of ICSI, right? Did you do it?

On Aug 23, 2018, at 8:21 AM, Eric Fosler-Lussier notifications@github.com wrote:

I noticed particularly when training big nets that the default learning rate mechanism is pretty inefficient. By and large, after the initial burn in and first rate drop, the system spends at most one fruitful epoch at each learning rate before having to backtrack. The "newbob" rate schedule is probably more efficient than the default (which is typically flat until no improvement, then halves until no improvement, then stops).

I'm planning to implement a separate learning rate scheduler module that can be switched in, which will allow for different implementations. I'll keep the current as a default, but we may want to consider switching to newbob at some point.

Plus I love the idea of "newbob" continuing to live on...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/195, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8WOY8k8nuTenjJc381zXqm8GsQJzks5uTp4wgaJpZM4WJaYi.

efosler commented 5 years ago

ICSI, yes, but It wasn't me - it was a programmer that Morgan had hired back when we had a NN trainer called BoB (Boxes of Boxes). He discovered the same kind of trend and just made "newbob" the default since it was the second thing he tried. The name carried over when Dave coded up quicknet.

I just finished the --roll deprecation; I'll code this up today and test it out, and then "roll" those two changes together once the other pull request gets sorted out.

ramonsanabria commented 5 years ago

Hi Eric,

Thank for bringing this. Yes, I love the idea of having a different module (maybe a class) that can control the learning reate schedule much more efficiently and in a more modular way. Please, let me know if I can help with that.

Here is where we concentrated our scheduler:

https://github.com/srvk/eesen/blob/tf_clean/tf/ctc-am/tf/tf_train.py#L132

Is newbob idea?

--lrate="D:l:c:dh,ds:n"

starts with the learning rate l; if the validation error reduction between two consecutive epochs is less than dh, the learning rate is scaled by c during each of the remaining epochs. Traing finally terminates when the validation error reduction between two consecutive epochs falls below ds. n is the minimum epoch number after which scaling can be performed.

If so, I think that we have some initial idea implemented. We currently have the initial burn-out period and we load the best previous weights. So, we need to implement this: error reduction between two consecutive epochs is less, where we can implement this two as another hyper parameter.

efosler commented 5 years ago

Yep, that seems about right. What is D?

How about this:

I'm about halfway through factorizing the code (called the old lrscheduler "Halvesies").

ramonsanabria commented 5 years ago

awesome works for me. Thanks Eric!

ramonsanabria commented 5 years ago

I just took the explanations from Yajie:

https://www.cs.cmu.edu/~ymiao/pdnntk/lrate.html

D was the type of scheduler. But i think --lrate_algorithm we are good to go.

efosler commented 5 years ago

OK, I have three schedulers working (Halvsies - the current schedule, Newbob, and Constantlr). Everything checks out for new runs. The code does not quite do the right thing for restarts, so I will fix that tomorrow and then make a pull request.

efosler commented 5 years ago

Created pull request #198 which implements the new lrscheduler objects.

ramonsanabria commented 5 years ago

awesome. Thank you very much again Eric. Will merge it now thanks.

There is one thing that keeps buzzing me. When restarting the training, should we not consider the previous TER? To know in which point are we in the ramping phase?

efosler commented 5 years ago

Heh. What I did was tag the particular lrscheduler messages in the log file so you know what phase you're in - the log files are read at the same time as the TER is recalculated from the relevant log file. For example (pulled from the output rather than the log file, using a rather odd set of parameters since I was making sure that stuff worked right):

[2018-08-24 15:39:32] Epoch 1 starting, learning rate: 0.06
[2018-08-24 15:39:57] Epoch 1 finished in 0 minutes
                Train    cost: 303.2, ter: 87.7%, #example: 297
                Validate cost: 182.0, ter: 88.9%, #example: 30
[2018-08-24 15:39:57] LRScheduler.Newbob: not updating learning rate for first 3 epochs
--------------------------------------------------------------------------------
[2018-08-24 15:39:57] Epoch 2 starting, learning rate: 0.06
[2018-08-24 15:40:21] Epoch 2 finished in 0 minutes
                Train    cost: 206.1, ter: 89.1%, #example: 297
                Validate cost: 117.2, ter: 92.8%, #example: 30
[2018-08-24 15:40:21] LRScheduler.Newbob: not updating learning rate for first 3 epochs
--------------------------------------------------------------------------------
[2018-08-24 15:40:21] Epoch 3 starting, learning rate: 0.06
[2018-08-24 15:40:45] Epoch 3 finished in 0 minutes
                Train    cost: 166.3, ter: 94.2%, #example: 297
                Validate cost: 140.0, ter: 97.3%, #example: 30
[2018-08-24 15:40:45] LRScheduler.Newbob: not updating learning rate for first 3 epochs
--------------------------------------------------------------------------------
[2018-08-24 15:40:45] Epoch 4 starting, learning rate: 0.06
[2018-08-24 15:41:10] Epoch 4 finished in 0 minutes
                Train    cost: 154.5, ter: 96.6%, #example: 297
                Validate cost: 108.3, ter: 96.4%, #example: 30
[2018-08-24 15:41:10] LRScheduler.Newbob: learning rate remaining constant 0.06, TER improved 0.9% from epoch 3
--------------------------------------------------------------------------------
[2018-08-24 15:41:10] Epoch 5 starting, learning rate: 0.06
[2018-08-24 15:41:34] Epoch 5 finished in 0 minutes
                Train    cost: 143.0, ter: 94.7%, #example: 297
                Validate cost: 97.5, ter: 98.3%, #example: 30
[2018-08-24 15:41:34] LRScheduler.Newbob: beginning ramping to learn rate 0.045, TER difference -1.9% under threshold 0.0% from epoch 4
restoring model from epoch 4
--------------------------------------------------------------------------------
[2018-08-24 15:41:35] Epoch 6 starting, learning rate: 0.045
[2018-08-24 15:41:59] Epoch 6 finished in 0 minutes
                Train    cost: 143.2, ter: 94.8%, #example: 297
                Validate cost: 107.6, ter: 96.2%, #example: 30
[2018-08-24 15:41:59] LRScheduler.Newbob: learning rate ramping to 0.03375, TER improved 0.3% from epoch 4
--------------------------------------------------------------------------------
[2018-08-24 15:41:59] Epoch 7 starting, learning rate: 0.03375
[2018-08-24 15:42:23] Epoch 7 finished in 0 minutes
                Train    cost: 140.8, ter: 95.6%, #example: 297
                Validate cost: 104.7, ter: 97.2%, #example: 30
[2018-08-24 15:42:23] LRScheduler.Newbob: stopping training, TER difference -1.0% under threshold 0.0% from epoch 6
restoring model from epoch 6
efosler commented 5 years ago

(Basically, the idea occurred to me when I saw the code looking for VALIDATE tags at restart to get the TER.)

efosler commented 5 years ago

Updated with a patch to the newbob scheduler (#202); closing the issue.