rpryzant / delete_retrieve_generate

PyTorch implementation of the Delete, Retrieve Generate style transfer algorithm
MIT License
132 stars 26 forks source link

CUDA out of memory #10

Closed czhxiaohuihui closed 5 years ago

czhxiaohuihui commented 5 years ago

2019-08-13 16:18:25,063 - INFO - MODEL HAS 9181445 params 2019-08-13 16:18:25,894 - INFO - EPOCH: 0 ITER: 0.0/692.2578125 WPS: 111470.56 LOSS: 9.1706 METRIC: 0.0000 2019-08-13 16:19:05,404 - INFO - EPOCH: 0 ITER: 200.0/692.2578125 WPS: 1296.01 LOSS: 5.8122 METRIC: 0.0000 2019-08-13 16:19:44,576 - INFO - EPOCH: 0 ITER: 400.0/692.2578125 WPS: 1307.14 LOSS: 5.0753 METRIC: 0.0000 2019-08-13 16:20:23,774 - INFO - EPOCH: 0 ITER: 600.0/692.2578125 WPS: 1306.30 LOSS: 4.8040 METRIC: 0.0000 2019-08-13 16:20:41,728 - INFO - EPOCH 0 COMPLETE. EVALUATING... 256/500...2019-08-13 16:20:42,080 - INFO - METRIC: 5.795916557312012. TIME: 0.35s CHECKPOINTING... 2019-08-13 16:20:42,339 - INFO - EPOCH: 1 ITER: 0.0/692.2578125 WPS: 2758.25 LOSS: 4.5814 METRIC: 5.7959 2019-08-13 16:21:21,594 - INFO - EPOCH: 1 ITER: 200.0/692.2578125 WPS: 1304.40 LOSS: 4.4112 METRIC: 5.7959 2019-08-13 16:22:00,776 - INFO - EPOCH: 1 ITER: 400.0/692.2578125 WPS: 1306.80 LOSS: 4.1710 METRIC: 5.7959 2019-08-13 16:22:39,949 - INFO - EPOCH: 1 ITER: 600.0/692.2578125 WPS: 1307.13 LOSS: 3.9828 METRIC: 5.7959 2019-08-13 16:22:57,935 - INFO - EPOCH 1 COMPLETE. EVALUATING... 0/500...Traceback (most recent call last): File "/home/chenzhanghui/.pycharm_helpers/pydev/pydevd.py", line 1741, in main() File "/home/chenzhanghui/.pycharm_helpers/pydev/pydevd.py", line 1735, in main globals = debugger.run(setup['file'], None, None, is_module) File "/home/chenzhanghui/.pycharm_helpers/pydev/pydevd.py", line 1135, in run pydev_imports.execfile(file, globals, locals) # execute the script File "/home/chenzhanghui/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/chenzhanghui/code/delete_retrieve_generate/train.py", line 203, in model, src_test, tgt_test, config) File "/home/chenzhanghui/code/delete_retrieve_generate/src/evaluation.py", line 147, in inference_metrics model, src, tgt, config) File "/home/chenzhanghui/code/delete_retrieve_generate/src/evaluation.py", line 111, in decode_dataset input_ids_aux, auxlens, auxmask) File "/home/chenzhanghui/code/delete_retrieve_generate/src/evaluation.py", line 76, in decode_minibatch aux_input, auxmask, auxlens) File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/chenzhanghui/code/delete_retrieve_generate/src/models.py", line 154, in forward decoder_logit = self.output_projection(tgt_outputs_reshape) File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, **kwargs) File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 67, in forward return F.linear(input, self.weight, self.bias) File "/home/chenzhanghui/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 1352, in linear ret = torch.addmm(torch.jit._unwrap_optional(bias), input, weight.t()) RuntimeError: CUDA out of memory. Tried to allocate 431.50 MiB (GPU 0; 10.92 GiB total capacity; 1.65 GiB already allocated; 51.50 MiB free; 365.57 MiB cached)

when I run "delete" model_type, I encounter this problem.

rpryzant commented 5 years ago

Hmm I looks like your GPU is running out of memory when it enters evaluation.

This hasn't happened to me before, but I've only run this on TITAN X's with 12 gigs of memory.

Can you give me the output of nvidia-smi?

Have you tried reducing the batch_size option in your config?

rpryzant commented 5 years ago

Closing due to inactivity.