gpu not fully utilized yet OOM error

4pal commented 6 years ago

here are the details.i use p2.xlarge instance with the following configurations:

tensorlow version 1.4.1 cuda version 8.0 cudnn version 6. gpu 11 gpu. I DONT UNDERSTAND WHY I GET THE ERROR BELOW.I HAVE MORE MEMORY THAN REQUIRED YET I GET OUT OF MEMORY ERROR. 2018-07-17 13:25:08.653662: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats: Limit: 11332668621 InUse: 180233984 MaxInUse: 269354496 NumAllocs: 21256 MaxAllocSize: 51247104

2018-07-17 13:25:08.653685: W tensorflow/core/common_runtime/bfcallocator.cc:277] *****__***xxx****____**xx 2018-07-17 13:25:08.653711: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[86784,34820] Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/ubuntu/deepqa/nmt.py", line 605, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/ubuntu/deepqa/nmt.py", line 598, in main run_main(FLAGS, default_hparams, train_fn, inference_fn) File "/home/ubuntu/deepqa/nmt.py", line 591, in run_main train_fn(hparams, target_session=target_session) File "deepqa/train.py", line 339, in train sample_tgt_data, avg_ckpts) File "deepqa/train.py", line 166, in run_full_eval eval_model, eval_sess, model_dir, hparams, summary_writer) File "deepqa/train.py", line 72, in run_internal_eval summary_writer, "dev") File "deepqa/train.py", line 497, in _internal_eval ppl = model_helper.compute_perplexity(model, sess, label) File "deepqa/model_helper.py", line 597, in compute_perplexity loss, predict_count, batch_size = model.eval(sess) File "deepqa/model.py", line 272, in eval self.batch_size]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[86784,34820] [[Node: dynamic_seq2seq/decoder/output_projection/Tensordot/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dynamic_seq2seq/decoder/output_projection/Tensordot/Reshape, dynamic_seq2seq/decoder/output_projection/Tensordot/Reshape_1)]] [[Node: dynamic_seq2seq/truediv/_137 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_471_dynamic_seq2seq/truediv", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'dynamic_seq2seq/decoder/output_projection/Tensordot/MatMul', defined at: File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/ubuntu/deepqa/nmt.py", line 605, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/ubuntu/deepqa/nmt.py", line 598, in main run_main(FLAGS, default_hparams, train_fn, inference_fn) File "/home/ubuntu/deepqa/nmt.py", line 591, in run_main train_fn(hparams, target_session=target_session) File "deepqa/train.py", line 297, in train eval_model = model_helper.create_eval_model(model_creator, hparams, scope) File "deepqa/model_helper.py", line 162, in create_eval_model extra_args=extra_args) File "deepqa/model.py", line 109, in init res = self.build_graph(hparams, scope=scope) File "deepqa/model.py", line 303, in build_graph encoder_outputs, encoder_state, hparams) File "deepqa/model.py", line 421, in _build_decoder logits = self.output_layer(outputs.rnn_output) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 575, in call outputs = self.call(inputs, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/core.py", line 156, in call [0]]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 2520, in tensordot ab_matmul = matmul(a_reshape, b_reshape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1891, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2437, in _mat_mul name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[86784,34820] [[Node: dynamic_seq2seq/decoder/output_projection/Tensordot/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dynamic_seq2seq/decoder/output_projection/Tensordot/Reshape, dynamic_seq2seq/decoder/output_projection/Tensordot/Reshape_1)]] [[Node: dynamic_seq2seq/truediv/_137 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_471_dynamic_seq2seq/truediv", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

4pal commented 6 years ago

solved by reducing the batch size to 64

ghost commented 5 years ago

I have tried both by reducing batch_size default to 64(line 201) on nmt.py as well as passing flag batch_size=64 . In both cases, I am still getting batch_size=128 in my hparams file. What am I doing wrong?

4pal commented 5 years ago

update your hparams file and save it.there are two hparams .make sure all re update with the same parameters.

4pal commented 5 years ago

use nvidia-smi to monitor gpu utilization ... adjust batch sizes until it fits ito the memory i.e 128,64,32,16 use textrank to summarize you data... which gpu are you using ..tesla k80,pv100 ? etc?

ghost commented 5 years ago

I have a single Nvidia gtx 1080 TI 11 gb. I had to delete the entire contents of the nmt-model folder before running it for the second time to have the changes in effect. But even for a batch size of 32 its running out of memory. My vocab size is 100k. Would reduce it to 30k and check again.

4pal commented 5 years ago

do you have train,dev and validation datasets? then you need to reduce the size of dev and validation datasets--also adjust their batch sizes.memory is also used during this process.

4pal commented 5 years ago

if you have little memory you may try https://github.com/Kyubyong/transformer ,nmt usually requires alot of resources in tems of memory ,which is expensive.This model uses attention mechanisms,fast and you dont have to have three sets of data,which is memory intensive.validation set is not used during training but to infer later.

tensorflow / nmt

gpu not fully utilized yet OOM error #362