google-research / lasertagger

Apache License 2.0
606 stars 91 forks source link

Problem when training using GPU #18

Closed tranthuykieu closed 3 years ago

tranthuykieu commented 3 years ago

I have tried to train the model in both CPU and GPU. It works well on CPU but I got the problem when run on GPU:

2021-05-03 09:56:44.869207: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *****__* 2021-05-03 09:56:44.869262: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[32768,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[32768,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node bert/encoder/layer_0/intermediate/dense/MatMul (defined at /home/kieuttt/anaconda3/envs/lasertagger/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[loss/Mean/_4031]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[32768,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node bert/encoder/layer_0/intermediate/dense/MatMul (defined at /home/kieuttt/anaconda3/envs/lasertagger/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

Does anyone get it or know how to fix it ? Thank you.

ekQ commented 3 years ago

I would try reducing the batch size to something smaller (e.g. 2) to see if you still get OOM.

tranthuykieu commented 3 years ago

Thanks ekQ, with smaller batch size as you suggest, I don't get OOM anymore.