stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Training does not resume where left off #170

Closed aphedges closed 4 years ago

aphedges commented 4 years ago

I have been training my own GloVe models for my job, and I noticed that training does not resume properly when loading the output of one run as the initial parameters of another. The cost at the beginning of the second run is higher than at the end of the first run, and after a couple of iterations, it spikes by over an order of magnitude before eventually going back down again.

I determined that the issue is that although we have the option of saving parameters and squared gradients, we can only load the parameters. Modifying the code to also allow the loading of squared gradients results in the second run pick off right where the first one ended.

I will be submitting a PR shortly with this fix to allow proper checkpointing and resuming. Without these changes, there isn't much point to resuming if it can fail so horribly.

aphedges commented 4 years ago

I thought it might be a good idea to clarify what I was trying to do and what was happening.

I was training 300-dimensional GloVe embeddings on Gigaword, following the recommended parameters provided in the paper, which among other things, recommends 100 epochs. My system restricts programs to a maximum runtime of 1 day, so I was only able to run 60 epochs. The end of the first run looked like this:

03/12/20 - 10:07.46AM, iter: 056, cost: 0.013656
03/12/20 - 10:28.53AM, iter: 057, cost: 0.013640
03/12/20 - 10:49.56AM, iter: 058, cost: 0.013624
03/12/20 - 11:10.58AM, iter: 059, cost: 0.013607
03/12/20 - 11:32.03AM, iter: 060, cost: 0.013592
    saving itermediate parameters for iter 060...done.

I used the saved parameters from epoch 60 to start another run of 40 epochs. The beginning looked like this:

03/12/20 - 12:56.22PM, iter: 001, cost: 0.016220
03/12/20 - 01:18.20PM, iter: 002, cost: 0.015725
03/12/20 - 01:40.12PM, iter: 003, cost: 0.015484
03/12/20 - 02:02.02PM, iter: 004, cost: 0.015300
03/12/20 - 02:23.39PM, iter: 005, cost: 4.038435
03/12/20 - 02:45.15PM, iter: 006, cost: 5.178788
03/12/20 - 03:06.47PM, iter: 007, cost: 8.037360
03/12/20 - 03:28.19PM, iter: 008, cost: 3.863456
03/12/20 - 03:49.57PM, iter: 009, cost: 2.563186

The spike in cost by several orders of magnitude shouldn't be happening, and the cost never fully recovered by the end.

I therefore made a fix to also load the squared gradients. The cost looked much more like what I expected:

03/12/20 - 02:41.57PM, iter: 001, cost: 0.013577
03/12/20 - 03:02.47PM, iter: 002, cost: 0.013563
03/12/20 - 03:23.39PM, iter: 003, cost: 0.013549
03/12/20 - 03:44.30PM, iter: 004, cost: 0.013534
03/12/20 - 04:05.20PM, iter: 005, cost: 0.013520
03/12/20 - 04:26.09PM, iter: 006, cost: 0.013507
03/12/20 - 04:46.59PM, iter: 007, cost: 0.013493
03/12/20 - 05:07.49PM, iter: 008, cost: 0.013480
03/12/20 - 05:28.42PM, iter: 009, cost: 0.013468

The PR I submitted includes the code to allow the seamless continuation of a previous run.