cost jumps after some epochs

rpvelloso commented 2 years ago

this happened in different machines, at different points during training. Couldn't find where this is happening in the source code. Can't share the data :(

TRAINING MODEL Read 563738072 lines. Initializing parameters...Using random seed 1635360331 done. vector size: 600 vocab size: 85748 x_max: 100 alpha: 0.75

10/27/21 - 03:59.19PM, iter: 1, cost: 0.071133 10/27/21 - 04:13.09PM, iter: 2, cost: 0.0551224 10/27/21 - 04:27.28PM, iter: 3, cost: 0.0505688 10/27/21 - 04:41.43PM, iter: 4, cost: 0.0454685 10/27/21 - 04:55.38PM, iter: 5, cost: 0.0416584 10/27/21 - 05:09.45PM, iter: 6, cost: 0.0392608 10/27/21 - 05:23.49PM, iter: 7, cost: 0.0376295 10/27/21 - 05:37.49PM, iter: 8, cost: 0.0365679 10/27/21 - 05:51.59PM, iter: 9, cost: 0.0359734 10/27/21 - 06:05.49PM, iter: 10, cost: 34.4493

AngledLuffa commented 2 years ago

I know the answer to this before I even ask, but is it a situation where you can give me temporary access to the machine where the training is happening?

rpvelloso commented 2 years ago

do you have any idea about what might be happening here? I guess something about cost[], but I can't pinpoint it.

rpvelloso commented 2 years ago

lowering eta to 0.01 helped:

10/27/21 - 07:38.53PM, iter: 1, cost: 0.0874979 10/27/21 - 07:53.09PM, iter: 2, cost: 0.069715 10/27/21 - 08:07.01PM, iter: 3, cost: 0.0608068 10/27/21 - 08:20.54PM, iter: 4, cost: 0.0554227 10/27/21 - 08:34.46PM, iter: 5, cost: 0.0514724 10/27/21 - 08:48.39PM, iter: 6, cost: 0.048066 10/27/21 - 09:02.32PM, iter: 7, cost: 0.0448908 10/27/21 - 09:16.53PM, iter: 8, cost: 0.0418435 10/27/21 - 09:31.28PM, iter: 9, cost: 0.0389835 10/27/21 - 09:46.21PM, iter: 10, cost: 0.0364648 10/27/21 - 10:01.12PM, iter: 11, cost: 0.0344151 10/27/21 - 10:16.08PM, iter: 12, cost: 0.0328401 10/27/21 - 10:31.01PM, iter: 13, cost: 0.0316272 10/27/21 - 10:45.54PM, iter: 14, cost: 0.0306751 10/27/21 - 11:00.39PM, iter: 15, cost: 0.0299081

maybe that was it? learning rate too large (I was using default value 0.05)

rpvelloso commented 2 years ago

I've made some changes to deal with this issue during training:

TRAINING MODEL Read 455065004 lines. Initializing parameters...Using random seed 1635978159 done. vector size: 600 vocab size: 85748 x_max: 100 alpha: 0.75 epochs: 40 eta: 0.05 11/03/21 - 07:39.33PM, iter: 1, cost: 0.0823551 11/03/21 - 07:56.42PM, iter: 2, cost: 0.064717 11/03/21 - 08:13.47PM, iter: 3, cost: 0.0599455 11/03/21 - 08:30.48PM, iter: 4, cost: 0.0536976 11/03/21 - 08:47.54PM, iter: 5, cost: 0.0487482 11/03/21 - 09:05.05PM, iter: 6, cost: 0.0456012 11/03/21 - 09:22.10PM, iter: 7, cost: 0.0435627 11/03/21 - 09:39.50PM, iter: 8, cost: 0.042235 11/03/21 - 09:56.59PM, iter: 9, cost: 43.8812 11/03/21 - 09:56.59PM cost increased, restoring last training checkpoint and lowering eta from 0.05 to 0.025 11/03/21 - 10:14.01PM, iter: 9, cost: 0.0361162 11/03/21 - 10:31.02PM, iter: 10, cost: 0.033401 11/03/21 - 10:48.04PM, iter: 11, cost: 0.0325761 11/03/21 - 11:05.09PM, iter: 12, cost: 0.0320624

I'm saving grad and weights each epoch, whenever cost increases I restore the last checkpoint and decrease learning rate by a decay rate. Seems to work.

rpvelloso commented 2 years ago

another change I've made: changed from double to single precision resulted in a major speed up. each epoch took 19min, now it's about 9min on the same machine.

rpvelloso commented 2 years ago

also rewrote it in C++/STL to get rid of malloc's and possible memory leaks.

rpvelloso commented 2 years ago

pushed my changes to my own cloned repo https://github.com/rpvelloso/GloVe 1) vocab count & glove rewritten in C++ (will rewrite other modules later); 3) glove runs twice as fast with fp16 math; 3) rollback epoch if cost increases and decays 'eta'.

closing.

stanfordnlp / GloVe

cost jumps after some epochs #199