Closed DevanshChoubey closed 4 years ago
Try with bs=3 and num_batch_accumulated =7 or bs = 2 and num_batch_accumulated =8. With 2 and 8 training is faster but maximum accuracy im getting after 52k steps is only 59.6 %. So i would suggest you to go with 3 & 7. Pls let me know the results after u try it.
Hi @karthikj11,
thanks for the response I will surely try with the new values.
ps- I was able to train the model with the default values by starting the checkpoint again and again on a P100. but after a 1000 steps, I abandoned it as the difference between my loss and the loss log file provided by @alexpolozov was tremendous.
Hi, Tried training the model with a P5000 and V100 with 16gb of Mem and still I got this error. after 100 step, it goes out of memory with the current config.
[2020-07-26T09:46:17] Logging to logdir/bert_run/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1 Loading model from logdir/bert_run/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint [2020-07-26T09:46:46] Step 100 stats, train: loss = 157.97793579101562 [2020-07-26T09:46:54] Step 100 stats, val: loss = 187.46903228759766 [2020-07-26T09:47:08] Step 100: loss=180.5266 Traceback (most recent call last): File "run.py", line 109, in
main()
File "run.py", line 77, in main
train.main(train_config)
File "/notebooks/rat-sql/ratsql/commands/train.py", line 274, in main
trainer.train(config, modeldir=args.logdir)
File "/notebooks/rat-sql/ratsql/commands/train.py", line 192, in train
norm_loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 14.80 GiB already allocated; 3.50 MiB free; 533.26 MiB cached)