25% less mem and 10% faster training: Do not upcast lm_head and embedding to float32

There is not much of a difference from upcasting lm_head and embed_tokens to FP32 while training/finetuning. Below are the results for the same. Note that I had to shift right axis up a little to show the difference (which was on average 8x10^-4 which can possibly be imprecisions adding up)

Coming to the memory stats,

bfloat16 - cont pretraing
GPU = NVIDIA A100-PCIE-40GB. Max memory = 39.496 GB.
11.398 GB of memory reserved. 
Goes till 29.2 GiB as observed on nvtop/torch.cuda.mem_get_info()
For IFT phase, it is 25.5 GiB
675.8237 seconds used for training.
11.26 minutes used for training.
Peak reserved memory = 28.701 GB.
Peak reserved memory for training = 17.303 GB.
Peak reserved memory % of max memory = 72.668 %.
Peak reserved memory for training % of max memory = 43.809 %.

float32
GPU = NVIDIA A100-PCIE-40GB. Max memory = 39.496 GB.
16.535 GB of memory reserved.
Goes till 37.5 GiB as observed on nvtop/torch.cuda.mem_get_info()
737.8355 seconds used for training.
12.3 minutes used for training.
Peak reserved memory = 37.16 GB.
Peak reserved memory for training = 20.625 GB.
Peak reserved memory % of max memory = 94.085 %.
Peak reserved memory for training % of max memory = 52.22 %.

unslothai / unsloth

25% less mem and 10% faster training: Do not upcast lm_head and embedding to float32 #1186