Closed egez closed 6 years ago
Running BERT-Base is generally fine, but unfortunately running BERT-Large fine-tuning on a 12GB GPU is mostly impossible for now. See the "Out-of-memory" section of the README for more information. I plan to implement workarounds that will make tuning BERT-Large on the GPU work out of the box, but I haven't gotten a chance to work on this yet because I'm focused on the multilingual models. For now the workarounds are:
(a) Use BERT-Base with plans to switch to BERT-Large later (b) Fork the repo and get OpenAI's gradient checkpoint working (and please report back if it works, I haven't tried it yet!) (c) Use a Cloud TPU, either paid for for free with Colab (see "Using BERT in Colab " in of the README) (d) Use feature-based method with BERT-Large (see "Using BERT to extract fixed feature vectors" in the README).
Thanks so much for the reply and the timely code/model release. All the suggestions here are quite interesting. Would you also consider multi-GPU training as a workaround the RAM limit, given that multi-GPU computing resources are more ubiquitously available outside Google?
MultiGPU would require using a different (more complex) TensorFlow interface and we want to stick to TPUEstimator, so if someone wants to implement that then probably they will have to fork it.
However, the problem is for reasonably long sequence lengths the max batch size of BERT-Large
on a 12GB GPU is 0 to 2 sequences, so even on 4 GPUs you'd still be looking at a batch size of 0 to 8, and batch sizes below 16 will generally degrade results (no matter the learning rate or number of epochs).
I agree that, with the huge BERT-Large model, the "data-parallelism" approach isn't the most effective way to scale it across multiple devices. My intuition points towards the "model-parallelism" approach, which might eventually be useful for TPU-based training too (e.g. L>>24, A>>16, H>>1024). But I understand that this might be out of the scope of this repo. Thanks for the discussion anyway.
Yeah, if we scale up beyond BERT-Large
we are probably going to take a fundamentally different modeling approach (i.e., not just train a 64-layer, 2048-dim Transformer) so it will probably be a different repository and a different project name.
Is it possible to fine tune just specific number of layers? Let's say the last 2 layers?
@MHDBST I think that is essentially what you do with BERT; they give you a model that is pre-trained with weights, and you then add on the final layers, e.g. classification layer, to suit your own specification, and therefore the training is much faster.
@BigBadBurrow Thanks. But I meant fine-tuning top 3 layers of BERT, and then having another non-linear layer on top of that.
For data-parallelism, this project may help you. https://github.com/HaoyuHu/bert-multi-gpu
can bert base model run on a 12gb gpu ?? without errors like cuda out of memory kind of thing
can bert base model run on a 12gb gpu ?? without errors like cuda out of memory kind of thing
Yes, of course. I am currently running bert base model on a 11GB 1080TI
Using Google Colab with a GPU
Please tell m above out of five which one is best for tunning BERT model in NLP
I need to implement pretrained Bert2Bert EncoderDecoderModel for sentence simplification. I tried finetuning the model on Colab but after 1 epoch I got Cuda out of memory error. My batch size was 4. Could anyone suggest me an alternative to this? I am doing my bachelor level project and i cannot afford a paid GPU service.
Given the huge number of parameters in Bert, I wonder whether it is at all feasible to fine tune on GPUs without going to the google cloud TPU offers. Has there been any benchmarking on the current implementation? If yes, what types of GPUs are expected to work? to how many layers and attention heads?