unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.64k stars 1.31k forks source link

Gemma 2 9b lm_head, emb_tokens probably not loading (but were saved) #918

Open richardxoldman opened 3 months ago

richardxoldman commented 3 months ago

I trained LoRA adapters for unsloth/gemma-2-9b-it-bnb-4bit models, but I also added lm_head, embed_tokens to the adapter

model = FastLanguageModel.get_peft_model(
    model,
    r = LORA_RANK,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj", "lm_head", "embed_tokens"],
    lora_alpha = LORA_ALPHA,
    lora_dropout = 0.0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 
)

then after training I saved it on my drive using

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

When I load it using

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = access_token,
)

I got much worse outputs as if only LoRA adapters without lm_head/embed_tokens were loaded. The adapter_model.safetensors file has ~7,3GB (when I was not saving lm_head/embed_tokens it was around 500MB), so I think the lm_head and embed_tokens were saved, but they are not loading correctly.

Adapter_config.json content:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "unsloth/gemma-2-9b-it-bnb-4bit",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 16,
  "lora_dropout": 0.0,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": [
    "lm_head",
    "embed_tokens"
  ],
  "peft_type": "LORA",
  "r": 32,
  "rank_pattern": {},
  "revision": "unsloth",
  "target_modules": [
    "gate_proj",
    "v_proj",
    "down_proj",
    "o_proj",
    "up_proj",
    "q_proj",
    "k_proj"
 ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false
}

UPDATE:

Actually I checked it and it loaded successfully

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): ModulesToSaveWrapper(
          (original_module): Embedding(256000, 3584)
          (modules_to_save): ModuleDict(
            (default): Embedding(256000, 3584)
          )
        )
 ...
      )
      (lm_head): ModulesToSaveWrapper(
        (original_module): Linear(in_features=3584, out_features=256000, bias=False)
        (modules_to_save): ModuleDict(
          (default): Linear(in_features=3584, out_features=256000, bias=False)
        )
      )

But still results during inference after loading the model are much different from what I had before saving the model.

richardxoldman commented 3 months ago

Ok so I was able to solve the problem.

In the gemma-9b finetune notebook there is a cell:

Now if you want to load the LoRA adapters we just saved for inference, set False to True:

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

I followed this cell (with removing the "if False"), but before inference it is also needed to call get_peft_model function (which was not stated in te Colab Notebook) to make the model work correctly:

model = FastLanguageModel.get_peft_model(
    model,
    r = LORA_RANK, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj", "lm_head", "embed_tokens"], 
    lora_alpha = LORA_ALPHA,
    lora_dropout = 0.0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None,
)

If it was only difference in dtype (before calling get_peft_model lm_head/embed_tokens were bfloat16 type, after calling get_peft_model those are float32), Then for my task the dtype make a pretty huge difference for me.