Closed czhxiaohuihui closed 5 years ago
Hmm I looks like your GPU is running out of memory when it enters evaluation.
This hasn't happened to me before, but I've only run this on TITAN X's with 12 gigs of memory.
Can you give me the output of nvidia-smi
?
Have you tried reducing the batch_size
option in your config?
Closing due to inactivity.
2019-08-13 16:18:25,063 - INFO - MODEL HAS 9181445 params 2019-08-13 16:18:25,894 - INFO - EPOCH: 0 ITER: 0.0/692.2578125 WPS: 111470.56 LOSS: 9.1706 METRIC: 0.0000 2019-08-13 16:19:05,404 - INFO - EPOCH: 0 ITER: 200.0/692.2578125 WPS: 1296.01 LOSS: 5.8122 METRIC: 0.0000 2019-08-13 16:19:44,576 - INFO - EPOCH: 0 ITER: 400.0/692.2578125 WPS: 1307.14 LOSS: 5.0753 METRIC: 0.0000 2019-08-13 16:20:23,774 - INFO - EPOCH: 0 ITER: 600.0/692.2578125 WPS: 1306.30 LOSS: 4.8040 METRIC: 0.0000 2019-08-13 16:20:41,728 - INFO - EPOCH 0 COMPLETE. EVALUATING... 256/500...2019-08-13 16:20:42,080 - INFO - METRIC: 5.795916557312012. TIME: 0.35s CHECKPOINTING... 2019-08-13 16:20:42,339 - INFO - EPOCH: 1 ITER: 0.0/692.2578125 WPS: 2758.25 LOSS: 4.5814 METRIC: 5.7959 2019-08-13 16:21:21,594 - INFO - EPOCH: 1 ITER: 200.0/692.2578125 WPS: 1304.40 LOSS: 4.4112 METRIC: 5.7959 2019-08-13 16:22:00,776 - INFO - EPOCH: 1 ITER: 400.0/692.2578125 WPS: 1306.80 LOSS: 4.1710 METRIC: 5.7959 2019-08-13 16:22:39,949 - INFO - EPOCH: 1 ITER: 600.0/692.2578125 WPS: 1307.13 LOSS: 3.9828 METRIC: 5.7959 2019-08-13 16:22:57,935 - INFO - EPOCH 1 COMPLETE. EVALUATING... 0/500...Traceback (most recent call last): File "/home/chenzhanghui/.pycharm_helpers/pydev/pydevd.py", line 1741, in
main()
File "/home/chenzhanghui/.pycharm_helpers/pydev/pydevd.py", line 1735, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/chenzhanghui/.pycharm_helpers/pydev/pydevd.py", line 1135, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/chenzhanghui/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/chenzhanghui/code/delete_retrieve_generate/train.py", line 203, in
model, src_test, tgt_test, config)
File "/home/chenzhanghui/code/delete_retrieve_generate/src/evaluation.py", line 147, in inference_metrics
model, src, tgt, config)
File "/home/chenzhanghui/code/delete_retrieve_generate/src/evaluation.py", line 111, in decode_dataset
input_ids_aux, auxlens, auxmask)
File "/home/chenzhanghui/code/delete_retrieve_generate/src/evaluation.py", line 76, in decode_minibatch
aux_input, auxmask, auxlens)
File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, *kwargs)
File "/home/chenzhanghui/code/delete_retrieve_generate/src/models.py", line 154, in forward
decoder_logit = self.output_projection(tgt_outputs_reshape)
File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(input, **kwargs)
File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 67, in forward
return F.linear(input, self.weight, self.bias)
File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 1352, in linear
ret = torch.addmm(torch.jit._unwrap_optional(bias), input, weight.t())
RuntimeError: CUDA out of memory. Tried to allocate 431.50 MiB (GPU 0; 10.92 GiB total capacity; 1.65 GiB already allocated; 51.50 MiB free; 365.57 MiB cached)
when I run "delete" model_type, I encounter this problem.