unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.65k stars 1.22k forks source link

Remove "embed_tokens" and "lm_head" Lora layers when loading CPT trained models #1227

Open daegonYu opened 6 hours ago

daegonYu commented 6 hours ago

When I load a model trained with CPT with ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head" ] layers for fine-tuning, the "embed_tokens" and "lm_head" layers are removed. Is this intentional?

In other words, is it intended that CPT training trains ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head" ] layers and fine-tuning trains ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj"] layers?

In the provided https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing#scrollTo=2ejIt2xSNKKp, the "embed_tokens" and "lm_head" layers are also trained during IFT.

Erland366 commented 5 hours ago

I think so yeah. Since in CPT you want the model to learn about the words and the vocabulary again and in FT, you only want the model to follow patterns given the model already understand the words and the vocabulary (so the model already understand the knowledge)

Technically, there's no one stopping you to train again the lm_head and embed_tokens. But Unsloth doesn't LoRA those layers (I think CMIIW) and it makes it requires a lot of memory to train.