Closed Thytu closed 6 months ago
Thank you for catching my mistake, Valentin - super appreciating noticing the reversal!
Fixed here: https://github.com/stas00/ml-engineering/pull/44
oh, bf16 regime just means that you do the math in bf16 - either through AMP or no AMP where model weights are in bf16.
At least in the LM the inputs are IDs, not floats. but further forwards' inputs are floats.
If you feel that more commentary is needed in that section please kindly suggest what to add.
In the Changing precision post-training section it is stated that :
When reading this statement I consider the following scenario:
I'm quite surprised and would have expected the opposite statement as converting weights from fp16 $[-65504;66504]$ to bf16 $[-2^{126}; 2^{127}]$ wouldn't results in a overflow where converting weights from bf16 $[-2^{126}; 2^{127}]$ to fp16 $[-65504;66504]$ could result in a under/overflow.
Is there something I'm overlooking or misunderstanding? Is the term "in bf16 regime" actually implying that it receives bf16 inputs?