microsoft / rat-sql

A relation-aware semantic parsing model from English to SQL
https://arxiv.org/abs/1911.04942
MIT License
406 stars 117 forks source link

Even 16GB isn't enough??? #9

Closed DevanshChoubey closed 4 years ago

DevanshChoubey commented 4 years ago

Hi, Tried training the model with a P5000 and V100 with 16gb of Mem and still I got this error. after 100 step, it goes out of memory with the current config.

[2020-07-26T09:46:17] Logging to logdir/bert_run/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1 Loading model from logdir/bert_run/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint [2020-07-26T09:46:46] Step 100 stats, train: loss = 157.97793579101562 [2020-07-26T09:46:54] Step 100 stats, val: loss = 187.46903228759766 [2020-07-26T09:47:08] Step 100: loss=180.5266 Traceback (most recent call last): File "run.py", line 109, in main() File "run.py", line 77, in main train.main(train_config) File "/notebooks/rat-sql/ratsql/commands/train.py", line 274, in main trainer.train(config, modeldir=args.logdir) File "/notebooks/rat-sql/ratsql/commands/train.py", line 192, in train norm_loss.backward() File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 14.80 GiB already allocated; 3.50 MiB free; 533.26 MiB cached)

karthikj11 commented 4 years ago

Try with bs=3 and num_batch_accumulated =7 or bs = 2 and num_batch_accumulated =8. With 2 and 8 training is faster but maximum accuracy im getting after 52k steps is only 59.6 %. So i would suggest you to go with 3 & 7. Pls let me know the results after u try it.

DevanshChoubey commented 4 years ago

Hi @karthikj11,

thanks for the response I will surely try with the new values.

ps- I was able to train the model with the default values by starting the checkpoint again and again on a P100. but after a 1000 steps, I abandoned it as the difference between my loss and the loss log file provided by @alexpolozov was tremendous.