sogou / SogouMRCToolkit

This toolkit was designed for the fast and efficient development of modern machine comprehension models, including both published models and original prototypes.
Apache License 2.0
746 stars 164 forks source link

run run_bert_coqa.py OOM #21

Closed yanchlu closed 5 years ago

yanchlu commented 5 years ago

I have used 3 GPUs to run this program. But it still comes out an OOM. ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[12,12,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node bert/encoder/layer_6/attention/self/Softmax (defined at /data2/wangfuyu/NQ/ycl/SMRCToolkit-master/sogou_mrc/libraries/modeling.py:728) = Softmax[T=DT_FLOAT, _class=["loc:@bert/encoder/layer_6/attention/self/cond/Switch_1"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_6/attention/self/add)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node truediv/_771}} = _Recv[client_terminated=false,recv_device="/job:localhost/replica:0/task:0/device:CPU:0",send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1,tensor_name="edge_5803_truediv", tensor_type=DT_FLOAT,_device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

yanchlu commented 5 years ago

recv_device="/job:localhost/replica:0/task:0/device:CPU:0" Is it the cause of OOM?

bigcat2333 commented 5 years ago

您好,我也遇到了这个问题,请问您最后是如何解决的呢?

yanchlu commented 5 years ago

把batch_size弄小一点

deepaknlp commented 5 years ago

把batch_size弄小一点

Hey @yanchlu, did you able to solve the issue?

yanchlu commented 5 years ago

把batch_size弄小一点

Hey @yanchlu, did you able to solve the issue?

I modified the batch_size, but the results were not satisfactory.

deepaknlp commented 5 years ago

@yanchlu Thank you for the quick response.