stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

cost jumps after some epochs #199

Closed rpvelloso closed 2 years ago

rpvelloso commented 2 years ago

this happened in different machines, at different points during training. Couldn't find where this is happening in the source code. Can't share the data :(

TRAINING MODEL Read 563738072 lines. Initializing parameters...Using random seed 1635360331 done. vector size: 600 vocab size: 85748 x_max: 100 alpha: 0.75

10/27/21 - 03:59.19PM, iter: 1, cost: 0.071133 10/27/21 - 04:13.09PM, iter: 2, cost: 0.0551224 10/27/21 - 04:27.28PM, iter: 3, cost: 0.0505688 10/27/21 - 04:41.43PM, iter: 4, cost: 0.0454685 10/27/21 - 04:55.38PM, iter: 5, cost: 0.0416584 10/27/21 - 05:09.45PM, iter: 6, cost: 0.0392608 10/27/21 - 05:23.49PM, iter: 7, cost: 0.0376295 10/27/21 - 05:37.49PM, iter: 8, cost: 0.0365679 10/27/21 - 05:51.59PM, iter: 9, cost: 0.0359734 10/27/21 - 06:05.49PM, iter: 10, cost: 34.4493

AngledLuffa commented 2 years ago

I know the answer to this before I even ask, but is it a situation where you can give me temporary access to the machine where the training is happening?

rpvelloso commented 2 years ago

do you have any idea about what might be happening here? I guess something about cost[], but I can't pinpoint it.

rpvelloso commented 2 years ago

lowering eta to 0.01 helped:

10/27/21 - 07:38.53PM, iter: 1, cost: 0.0874979 10/27/21 - 07:53.09PM, iter: 2, cost: 0.069715 10/27/21 - 08:07.01PM, iter: 3, cost: 0.0608068 10/27/21 - 08:20.54PM, iter: 4, cost: 0.0554227 10/27/21 - 08:34.46PM, iter: 5, cost: 0.0514724 10/27/21 - 08:48.39PM, iter: 6, cost: 0.048066 10/27/21 - 09:02.32PM, iter: 7, cost: 0.0448908 10/27/21 - 09:16.53PM, iter: 8, cost: 0.0418435 10/27/21 - 09:31.28PM, iter: 9, cost: 0.0389835 10/27/21 - 09:46.21PM, iter: 10, cost: 0.0364648 10/27/21 - 10:01.12PM, iter: 11, cost: 0.0344151 10/27/21 - 10:16.08PM, iter: 12, cost: 0.0328401 10/27/21 - 10:31.01PM, iter: 13, cost: 0.0316272 10/27/21 - 10:45.54PM, iter: 14, cost: 0.0306751 10/27/21 - 11:00.39PM, iter: 15, cost: 0.0299081

maybe that was it? learning rate too large (I was using default value 0.05)

rpvelloso commented 2 years ago

I've made some changes to deal with this issue during training:

TRAINING MODEL Read 455065004 lines. Initializing parameters...Using random seed 1635978159 done. vector size: 600 vocab size: 85748 x_max: 100 alpha: 0.75 epochs: 40 eta: 0.05 11/03/21 - 07:39.33PM, iter: 1, cost: 0.0823551 11/03/21 - 07:56.42PM, iter: 2, cost: 0.064717 11/03/21 - 08:13.47PM, iter: 3, cost: 0.0599455 11/03/21 - 08:30.48PM, iter: 4, cost: 0.0536976 11/03/21 - 08:47.54PM, iter: 5, cost: 0.0487482 11/03/21 - 09:05.05PM, iter: 6, cost: 0.0456012 11/03/21 - 09:22.10PM, iter: 7, cost: 0.0435627 11/03/21 - 09:39.50PM, iter: 8, cost: 0.042235 11/03/21 - 09:56.59PM, iter: 9, cost: 43.8812 11/03/21 - 09:56.59PM cost increased, restoring last training checkpoint and lowering eta from 0.05 to 0.025 11/03/21 - 10:14.01PM, iter: 9, cost: 0.0361162 11/03/21 - 10:31.02PM, iter: 10, cost: 0.033401 11/03/21 - 10:48.04PM, iter: 11, cost: 0.0325761 11/03/21 - 11:05.09PM, iter: 12, cost: 0.0320624

I'm saving grad and weights each epoch, whenever cost increases I restore the last checkpoint and decrease learning rate by a decay rate. Seems to work.

rpvelloso commented 2 years ago

another change I've made: changed from double to single precision resulted in a major speed up. each epoch took 19min, now it's about 9min on the same machine.

rpvelloso commented 2 years ago

also rewrote it in C++/STL to get rid of malloc's and possible memory leaks.

rpvelloso commented 2 years ago

pushed my changes to my own cloned repo https://github.com/rpvelloso/GloVe 1) vocab count & glove rewritten in C++ (will rewrite other modules later); 3) glove runs twice as fast with fp16 math; 3) rollback epoch if cost increases and decays 'eta'.

closing.