Open daegonYu opened 6 hours ago
I think so yeah. Since in CPT you want the model to learn about the words and the vocabulary again and in FT, you only want the model to follow patterns given the model already understand the words and the vocabulary (so the model already understand the knowledge)
Technically, there's no one stopping you to train again the lm_head
and embed_tokens
. But Unsloth doesn't LoRA those layers (I think CMIIW) and it makes it requires a lot of memory to train.
When I load a model trained with CPT with ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head" ] layers for fine-tuning, the "embed_tokens" and "lm_head" layers are removed. Is this intentional?
In other words, is it intended that CPT training trains ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head" ] layers and fine-tuning trains ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj"] layers?
In the provided https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing#scrollTo=2ejIt2xSNKKp, the "embed_tokens" and "lm_head" layers are also trained during IFT.