Loss does not drop - Githubissues

ZQS1943 commented 4 years ago

Hi,

I'm trying to run your model but during training the loss does not drop.

Here is the part of the loss.

(qiusi) qiusi@mewtwo:~/rat-sql$ python run.py preprocess experiments/spider-glove-run.jsonnet
WARNING <class 'ratsql.models.enc_dec.EncDecModel.Preproc'>: superfluous {'name': 'EncDec'}
DB connections: 100%|████████████████████████████████████████████| 166/166 [00:01<00:00, 84.90it/s]
train section: 100%|███████████████████████████████████████████| 8659/8659 [07:43<00:00, 18.69it/s]
DB connections: 100%|███████████████████████████████████████████| 166/166 [00:01<00:00, 115.15it/s]
val section: 100%|█████████████████████████████████████████████| 1034/1034 [00:47<00:00, 21.98it/s]

(qiusi) qiusi@mewtwo:~/rat-sql$ python run.py train experiments/spider-glove-run.jsonnet
[2020-07-11T20:53:06] Logging to logdir/glove_run/bs=20,lr=7.4e-04,end_lr=0e0,att=0
Loading model from logdir/glove_run/bs=20,lr=7.4e-04,end_lr=0e0,att=0/model_checkpoint
[2020-07-11T20:55:38] Step 10: loss=178.1385
[2020-07-11T20:57:36] Step 20: loss=185.5836
[2020-07-11T20:59:29] Step 30: loss=165.1267
[2020-07-11T21:01:16] Step 40: loss=175.4593
[2020-07-11T21:03:07] Step 50: loss=191.4991
[2020-07-11T21:05:07] Step 60: loss=198.2391
[2020-07-11T21:07:04] Step 70: loss=204.6865
[2020-07-11T21:08:56] Step 80: loss=156.1282
[2020-07-11T21:10:47] Step 90: loss=158.8194
[2020-07-11T21:12:44] Step 100 stats, train: loss = 159.02596282958984
[2020-07-11T21:13:02] Step 100 stats, val: loss = 188.73794555664062
[2020-07-11T21:13:13] Step 100: loss=195.0599
[2020-07-11T21:15:03] Step 110: loss=166.1000
[2020-07-11T21:16:56] Step 120: loss=160.7225
[2020-07-11T21:18:48] Step 130: loss=172.8267
[2020-07-11T21:20:38] Step 140: loss=200.5286
[2020-07-11T21:22:29] Step 150: loss=194.8727
[2020-07-11T21:24:21] Step 160: loss=211.9967
[2020-07-11T21:26:17] Step 170: loss=215.9024
[2020-07-11T21:28:10] Step 180: loss=196.7601
[2020-07-11T21:30:00] Step 190: loss=186.0729
[2020-07-11T21:32:02] Step 200 stats, train: loss = 159.02596282958984
[2020-07-11T21:32:19] Step 200 stats, val: loss = 188.73794555664062
[2020-07-11T21:32:29] Step 200: loss=175.3226
[2020-07-11T21:34:22] Step 210: loss=176.8896
[2020-07-11T21:36:17] Step 220: loss=229.3411
[2020-07-11T21:38:08] Step 230: loss=183.6960
[2020-07-11T21:39:59] Step 240: loss=201.4256
[2020-07-11T21:41:50] Step 250: loss=171.4176

The loss remains unchanged in the two eval_on_train, as if the model has not been updated.

I did not use docker but installed related dependencies. Is the error caused by this reason? Thank you!!!

ZQS1943 commented 4 years ago

so sorry. I manually deleted all checkpoints and ran again, and the loss dropped. I don't know why.

Chlorie commented 3 years ago

I recently ran into a similar problem like this as well. Looking at your logs it might be the same problem: When the checkpoint at the 1st step (model_checkpoint-00000001) is loaded the model just won't optimize. Later checkpoints work fine though. This is really weird...

Also, I modified the code to support multi-GPU training, and when I set batch accumulation to 1 (no accumulation, just one big batch of 24 instead of 4 accumulations of 6) the model won't optimize. Setting it to 6x4 or 12x2 works.

I'm scratching my hair off trying to find the cause of this issue but to no avail.

microsoft / rat-sql

Loss does not drop #3