Fine tuning Bert base/large on GPUs

egez commented 6 years ago

Given the huge number of parameters in Bert, I wonder whether it is at all feasible to fine tune on GPUs without going to the google cloud TPU offers. Has there been any benchmarking on the current implementation? If yes, what types of GPUs are expected to work? to how many layers and attention heads?

jacobdevlin-google commented 6 years ago

Running BERT-Base is generally fine, but unfortunately running BERT-Large fine-tuning on a 12GB GPU is mostly impossible for now. See the "Out-of-memory" section of the README for more information. I plan to implement workarounds that will make tuning BERT-Large on the GPU work out of the box, but I haven't gotten a chance to work on this yet because I'm focused on the multilingual models. For now the workarounds are:

(a) Use BERT-Base with plans to switch to BERT-Large later (b) Fork the repo and get OpenAI's gradient checkpoint working (and please report back if it works, I haven't tried it yet!) (c) Use a Cloud TPU, either paid for for free with Colab (see "Using BERT in Colab " in of the README) (d) Use feature-based method with BERT-Large (see "Using BERT to extract fixed feature vectors" in the README).

egez commented 6 years ago

Thanks so much for the reply and the timely code/model release. All the suggestions here are quite interesting. Would you also consider multi-GPU training as a workaround the RAM limit, given that multi-GPU computing resources are more ubiquitously available outside Google?

jacobdevlin-google commented 6 years ago

MultiGPU would require using a different (more complex) TensorFlow interface and we want to stick to TPUEstimator, so if someone wants to implement that then probably they will have to fork it.

However, the problem is for reasonably long sequence lengths the max batch size of BERT-Large on a 12GB GPU is 0 to 2 sequences, so even on 4 GPUs you'd still be looking at a batch size of 0 to 8, and batch sizes below 16 will generally degrade results (no matter the learning rate or number of epochs).

egez commented 6 years ago

I agree that, with the huge BERT-Large model, the "data-parallelism" approach isn't the most effective way to scale it across multiple devices. My intuition points towards the "model-parallelism" approach, which might eventually be useful for TPU-based training too (e.g. L>>24, A>>16, H>>1024). But I understand that this might be out of the scope of this repo. Thanks for the discussion anyway.

jacobdevlin-google commented 6 years ago

Yeah, if we scale up beyond BERT-Large we are probably going to take a fundamentally different modeling approach (i.e., not just train a 64-layer, 2048-dim Transformer) so it will probably be a different repository and a different project name.

MHDBST commented 5 years ago

Is it possible to fine tune just specific number of layers? Let's say the last 2 layers?

BigBadBurrow commented 5 years ago

@MHDBST I think that is essentially what you do with BERT; they give you a model that is pre-trained with weights, and you then add on the final layers, e.g. classification layer, to suit your own specification, and therefore the training is much faster.

MHDBST commented 5 years ago

@BigBadBurrow Thanks. But I meant fine-tuning top 3 layers of BERT, and then having another non-linear layer on top of that.

haoyuhu commented 5 years ago

For data-parallelism, this project may help you. https://github.com/HaoyuHu/bert-multi-gpu

Ravikiran2611 commented 5 years ago

can bert base model run on a 12gb gpu ?? without errors like cuda out of memory kind of thing

jackyrpi2016 commented 5 years ago

can bert base model run on a 12gb gpu ?? without errors like cuda out of memory kind of thing

Yes, of course. I am currently running bert base model on a 11GB 1080TI

reddyprasad28 commented 4 years ago

Using Google Colab with a GPU

On Windows machine without a GPU
On Windows machine which has an NVIDIA GPU installed internally (specify any one GPU model)
On Windows machine which has an NVIDIA GPU attached externally (specify any one GPU model)
By connecting the windows machine to a linux cloud instance on Azure or GCP or AWS, with a connected GPU

Please tell m above out of five which one is best for tunning BERT model in NLP

Aakash12980 commented 4 years ago

I need to implement pretrained Bert2Bert EncoderDecoderModel for sentence simplification. I tried finetuning the model on Colab but after 1 epoch I got Cuda out of memory error. My batch size was 4. Could anyone suggest me an alternative to this? I am doing my bachelor level project and i cannot afford a paid GPU service.

google-research / bert

Fine tuning Bert base/large on GPUs #4