run run_bert_coqa.py OOM

yanchlu commented 5 years ago

I have used 3 GPUs to run this program. But it still comes out an OOM. ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[12,12,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node bert/encoder/layer_6/attention/self/Softmax (defined at /data2/wangfuyu/NQ/ycl/SMRCToolkit-master/sogou_mrc/libraries/modeling.py:728) = Softmax[T=DT_FLOAT, _class=["loc:@bert/encoder/layer_6/attention/self/cond/Switch_1"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_6/attention/self/add)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node truediv/_771}} = _Recv[client_terminated=false,recv_device="/job:localhost/replica:0/task:0/device:CPU:0",send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1,tensor_name="edge_5803_truediv", tensor_type=DT_FLOAT,_device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

yanchlu commented 5 years ago

recv_device="/job:localhost/replica:0/task:0/device:CPU:0" Is it the cause of OOM?

bigcat2333 commented 5 years ago

您好，我也遇到了这个问题，请问您最后是如何解决的呢？

yanchlu commented 5 years ago

把batch_size弄小一点

deepaknlp commented 5 years ago

把batch_size弄小一点

Hey @yanchlu, did you able to solve the issue?

yanchlu commented 5 years ago

把batch_size弄小一点

Hey @yanchlu, did you able to solve the issue?

I modified the batch_size, but the results were not satisfactory.

deepaknlp commented 5 years ago

@yanchlu Thank you for the quick response.

sogou / SogouMRCToolkit

run run_bert_coqa.py OOM #21