Why do training lm_head and embed_tokens require converting accuracy to fp32

unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

12.29k stars 796 forks source link

Why do training lm_head and embed_tokens require converting accuracy to fp32 #633

Open letterk opened 3 weeks ago

letterk commented 3 weeks ago

I cannot train Qwen2 7B on a 4090 GPU as it would result in out-of-memory (OOM) errors due to the loading of the embedding layer. This process is anticipated to demand over 27GB of VRAM, exceeding the capacity of the GPU. In contrast, QLoRA requires significantly less memory, operating effectively with under 12GB.

danielhanchen commented 3 weeks ago

Unfortunately one must use float32 for it - in theory bfloat16 can be used, but the gradients will not be correct due to mixed precision training

danielhanchen commented 3 weeks ago

I would unset them - another approach is to just train the lm_head and not the embed_tokens to save more memory