unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.41k stars 1.29k forks source link

25% less mem and 10% faster training: Do not upcast lm_head and embedding to float32 #1186

Closed Datta0 closed 4 weeks ago

Datta0 commented 4 weeks ago

There is not much of a difference from upcasting lm_head and embed_tokens to FP32 while training/finetuning. Below are the results for the same. Note that I had to shift right axis up a little to show the difference (which was on average 8x10-4 which can possibly be imprecisions adding up)

Coming to the memory stats,

bfloat16 - cont pretraing
GPU = NVIDIA A100-PCIE-40GB. Max memory = 39.496 GB.
11.398 GB of memory reserved. 
Goes till 29.2 GiB as observed on nvtop/torch.cuda.mem_get_info()
For IFT phase, it is 25.5 GiB
675.8237 seconds used for training.
11.26 minutes used for training.
Peak reserved memory = 28.701 GB.
Peak reserved memory for training = 17.303 GB.
Peak reserved memory % of max memory = 72.668 %.
Peak reserved memory for training % of max memory = 43.809 %.
float32
GPU = NVIDIA A100-PCIE-40GB. Max memory = 39.496 GB.
16.535 GB of memory reserved.
Goes till 37.5 GiB as observed on nvtop/torch.cuda.mem_get_info()
737.8355 seconds used for training.
12.3 minutes used for training.
Peak reserved memory = 37.16 GB.
Peak reserved memory for training = 20.625 GB.
Peak reserved memory % of max memory = 94.085 %.
Peak reserved memory for training % of max memory = 52.22 %.

image

image