[BUG] Concern around mixed precision training where weights are in low precision

ethansmith2000 commented 6 months ago

I noticed that in deepspeed, when training with fp16 and bf16, weights are set to the lower precision. I am wondering if there is any chance of making this optional. For both bf16 and fp16 there is the risk of having the weight change "dissapear" due to the low precision

This paper brought first brought the issue to my attention: https://arxiv.org/abs/2010.06192

Empirically, have found a lot of diffusion model training to have small gradient norms, often around 0.02 or so. In BF16 and possibly even fp16 it appears that this optimization step may not even register.

In fp16, more bytes are allocated to the mantissa so its less risky but still seems like a potential issue.

fetching the dtype of the optimizer states or model weights does show they are in the reduced precision but to make sure i also checked the gpu memory usage. The below is zero stage-1 training of SDXL

Additionally, I had previously mentioned here that the Deepspeed BERT training example suffers significant performance loss when running in bf16

ethansmith2000 commented 4 months ago

wanted to link this one here too https://github.com/Lightning-AI/pytorch-lightning/issues/18016

SonicCodes commented 2 weeks ago

found the solution sire?

microsoft / DeepSpeed

[BUG] Concern around mixed precision training where weights are in low precision #5307