google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.92k stars 9.57k forks source link

OOM error fine-tuning #879

Open ecatkins opened 4 years ago

ecatkins commented 4 years ago

When trying to finetune BERT on a classification task (run_classifier.py) using my own dataset, I am running into the OOM issue with the following traceback:

iB
2019-10-15 18:21:25.247491: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 8384512 totalling 8.00MiB
2019-10-15 18:21:25.247501: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 262 Chunks of size 16777216 totalling 4.09GiB
2019-10-15 18:21:25.247511: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 16781312 totalling 16.00MiB
2019-10-15 18:21:25.247520: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 20971520 totalling 20.00MiB
2019-10-15 18:21:25.247530: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 5 Chunks of size 118767616 totalling 566.33MiB
2019-10-15 18:21:25.247540: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 128971776 totalling 123.00MiB
2019-10-15 18:21:25.247548: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 6.88GiB
2019-10-15 18:21:25.247560: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: 
Limit:                  7392346112
InUse:                  7392346112
MaxInUse:               7392346112
NumAllocs:                    2204
MaxAllocSize:            128971776

2019-10-15 18:21:25.247633: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ****************************************************************************************************
2019-10-15 18:21:25.247668: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[1024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

(This doesn't break the script, it just keeps running).

I've tried reducing the batch size from 32 -> 16 -> 4 -> 1, none of which have an impact. I am using a Tesla P4 with 8GB. Is my issue as simple as having to increase my GPU memory? Or is there something else going on?

ecatkins commented 4 years ago

I did solve this by upgrading to a Tesla T4 16GB (batch size is still pretty limited though). Is this worth a note in the README? I've never had an issue with that GPU before across DL tasks (e.g. used it to train TensorFlow object detection models) -> so it might just be worth indicating to people where they need to start with GPU size.