Question about changing precision post training

Thytu commented 6 months ago

In the Changing precision post-training section it is stated that :

Using fp16-pretrained model in bf16 regime usually fails - due to overflows [...]

Using bf16-pretrained model in fp16 regime usually works - it will lose some performance on conversion [...]

When reading this statement I consider the following scenario:

model_in_fp16.to(bf16) # Overflow
model_in_bf16.to(fp16) # OK

I'm quite surprised and would have expected the opposite statement as converting weights from fp16 $[-65504;66504]$ to bf16 $[-2^{126}; 2^{127}]$ wouldn't results in a overflow where converting weights from bf16 $[-2^{126}; 2^{127}]$ to fp16 $[-65504;66504]$ could result in a under/overflow.

Is there something I'm overlooking or misunderstanding? Is the term "in bf16 regime" actually implying that it receives bf16 inputs?

stas00 commented 6 months ago

Thank you for catching my mistake, Valentin - super appreciating noticing the reversal!

Fixed here: https://github.com/stas00/ml-engineering/pull/44

stas00 commented 6 months ago

oh, bf16 regime just means that you do the math in bf16 - either through AMP or no AMP where model weights are in bf16.

At least in the LM the inputs are IDs, not floats. but further forwards' inputs are floats.

If you feel that more commentary is needed in that section please kindly suggest what to add.

stas00 / ml-engineering

Question about changing precision post training #41