Feature Request: bfloat16 training

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

8.01k stars 978 forks source link

Feature Request: bfloat16 training #243

Closed ikergarcia1996 closed 2 years ago

ikergarcia1996 commented 2 years ago

Hi! Thanks for the accelerate project, It is helping me a lot in my research :)

I am working with T5 models. Training T5 models (I found this issue with mT5 and T5-large) with FP16 is not possible (without tricks that may slowdown the training) the loss will overflow or underflow leading to NaN loss (https://github.com/huggingface/transformers/pull/10956). The best solution proposed up to date is training with BF16 instead of FP16. However I cannot find any option to use BF16 instead of FP16 in accelerate. LossNan

The adoption of the bf16 format is spreading and more and more models are being trained in this format. In addition, hardware that supports bf16 is also becoming more widespread. Would it be possible to add support for bf16 in accelerate? Or is there a way to use BF16 that I haven't figured out? I can't find anything in the documentation

sgugger commented 2 years ago

Hi there! You're right it's not added yet. It came up pretty recently (and you will need a very recent GPU) or were you talking of bfloat16 support for TPU training?

ikergarcia1996 commented 2 years ago

I am talking about running bfloat16 in a GPU. I am currently experimenting with an Ampere GPU that supports this feature. Taking a look at the AMP documentation in PyTorch, using bf16 should be as simple as adding an additional parameter to autocast

with autocast(dtype=torch.bfloat16):
    ...

Would it be possible to add an option to the configuration to specify the data type of the autocast function? I could help with a pull request if that's OK.

sgugger commented 2 years ago

You just have to be a bit careful since that argument was only introduced in recent versions of PyTorch and Accelerate supports version 1.4.0 and above, but I think that's the gist of it yes. I'm happy to review a PR if you want to dive into this :-)

neel04 commented 2 years ago

... bfloat16 support for TPU training?

Just to confirm - TPU training uses bfloat16 right? And does it work in multi-TPU settings (as provided by TRC)

sgugger commented 2 years ago

They use bfloat16 internally, but convert back outputs of the layers to FP32. There is a an environment variable you can set to have the training in full bfloat16.

Closing the issue since the PR adding support on GPU has been merged :-)

cyk1337 commented 2 years ago

Hi @ikergarcia1996, I also met the same Nan problem for training mt5 using fp16. How did you solve it?

ikergarcia1996 commented 2 years ago

@cyk1337 You must train the model using BF16 instead of FP16 (Accelerate already supports bf16). If you GPU doesn't support BF16 you should train the model using FP32. I don't know if somebody has find a way to successfully finetune T5 in FP16.

lixin4ever commented 1 year ago

@ikergarcia1996 BF16 indeed works and no NAN problem occurs. Just wonder did you observe any performance drop after using BF16? In our experiment, we find that the performance of mT5 trained with BF16 is pretty much worse than the mT5 trained with FP32