unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.4k stars 1.29k forks source link

Remove "embed_tokens" and "lm_head" Lora layers when loading CPT trained models #1227

Closed daegonYu closed 2 weeks ago

daegonYu commented 3 weeks ago

When I load a model trained with CPT with ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head" ] layers for fine-tuning, the "embed_tokens" and "lm_head" layers are removed. Is this intentional?

In other words, is it intended that CPT training trains ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head" ] layers and fine-tuning trains ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj"] layers?

In the provided https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing#scrollTo=2ejIt2xSNKKp, the "embed_tokens" and "lm_head" layers are also trained during IFT.

Erland366 commented 3 weeks ago

I think so yeah. Since in CPT you want the model to learn about the words and the vocabulary again and in FT, you only want the model to follow patterns given the model already understand the words and the vocabulary (so the model already understand the knowledge)

Technically, there's no one stopping you to train again the lm_head and embed_tokens. But Unsloth doesn't LoRA those layers (I think CMIIW) and it makes it requires a lot of memory to train.

daegonYu commented 2 weeks ago

Oh I see. thank you