unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.41k stars 1.29k forks source link

Unsloth Phi-3.5 LoRA: 3x the Number of Trainable Parameters with the Same Hyperparameters #1324

Open KristianMoellmann opened 8 hours ago

KristianMoellmann commented 8 hours ago

Hi! I've observed the following when using Unsloth.

Summary

When fine-tuning the Unsloth Phi-3.5 model with LoRA, the trainable parameters are approximately 3x higher compared to the Microsoft Phi-3.5 implementation, despite using identical hyperparameters and target modules.

Details

Microsoft Phi-3.5 Model

Using the following configuration:

from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer

lora_alpha = 64
lora_r = 32
lora_dropout = 0
lora_target_modules = "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj"
load_in_4bit = True

bnb_config = BitsAndBytesConfig(
    load_in_4bit=load_in_4bit,
    bnb_4bit_use_double_quant=True,
)

model_name = "microsoft/Phi-3.5-mini-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
)

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=lora_target_modules.split(","),
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name, trust_remote_code=True
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    peft_config=peft_config,
)

trainer.model.print_trainable_parameters()

Output:

trainable params: 17,825,792 || all params: 3,838,905,344 || trainable%: 0.4643

Unsloth-Based Setup

Configuration:

from unsloth import FastLanguageModel

model_name_unsloth = "unsloth/Phi-3.5-mini-instruct-bnb-4bit"

model_unsloth, _ = FastLanguageModel.from_pretrained(
    model_name=model_name_unsloth,
    load_in_4bit=load_in_4bit,
)

tokenizer_unsloth = AutoTokenizer.from_pretrained(
    model_name_unsloth, trust_remote_code=True
)

model_unsloth = FastLanguageModel.get_peft_model(
    model_unsloth,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=lora_target_modules.split(","),
)
peft_config_unsloth = None

trainer_unsloth = SFTTrainer(
    model=model_unsloth,
    tokenizer=tokenizer_unsloth,
    peft_config=peft_config_unsloth,
)

trainer_unsloth.accelerator.print(f"{trainer_unsloth.model}")
trainer_unsloth.model.print_trainable_parameters()

Output:

trainable params: 59,768,832 || all params: 3,880,848,384 || trainable%: 1.5401

As you can see, there is a huge discrepancy in the number of trainable parameters.

Am I doing something wrong or is this unintentional behaviour?

Thank you for your ongoing work!