unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.38k stars 805 forks source link

Error when loading pretrained adapters or 16bit lora adapters #154

Open asphytheghoul opened 5 months ago

asphytheghoul commented 5 months ago

Hello so i was fine-tuning a llama-2 model with unsloth using a tokenizer of my own, it has an extended vocabulary of around 48000 tokens in total, the tokenizer is compatible and checks have been made from my end to ensure the same. This is the code i have implemented using the colab notebook you have provided and I am unable to load my adapters after fine-tuning :

from unsloth import FastLanguageModel
import torch
from transformers import AutoTokenizer
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
]

tokenizer = AutoTokenizer.from_pretrained("MY_TOKENIZER")

model= FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-2-7b-bnb-4bit", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)[0]

model.resize_token_embeddings(len(tokenizer))

model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    # eval_dataset = dataset["test"],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 10,
        # evaluation_strategy="steps", #### ADDED ####
        learning_rate = 1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
trainer_stats = trainer.train()

But when i load it using :

if True:
    from unsloth import FastLanguageModel
    model= FastLanguageModel.from_pretrained(
        model_name = "./model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )[0]
``` , I get the following error : ``` ==((====))==  Unsloth: Fast Llama patching release 2024.2
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.1.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.22.post7. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
/usr/local/lib/python3.10/dist-packages/transformers/quantizers/auto.py:147: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be prevail.
  warnings.warn(warning_msg)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-34-0d818d623120>](https://localhost:8080/#) in <cell line: 1>()
      1 if True:
      2     from unsloth import FastLanguageModel
----> 3     model= FastLanguageModel.from_pretrained(
      4         model_name = "./translation-en-hin-no-merges", # YOUR MODEL YOU USED FOR TRAINING
      5         max_seq_length = max_seq_length,

4 frames
[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in load_state_dict(self, state_dict, strict, assign)
   2150 
   2151         if len(error_msgs) > 0:
-> 2152             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   2153                                self.__class__.__name__, "\n\t".join(error_msgs)))
   2154         return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([47943, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
    size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([47943, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

Please do help out :)

danielhanchen commented 5 months ago

@asphytheghoul Whoops in llm_int8_skip_modules - in ur config file config.json change llm_int8_skip_modules = "null" to llm_int8_skip_modules = null with no speech marks - I just fixed it on my side - sorry!

In terms of extending the tokenizer - you also need to update the lm_head and embedding matrix for eg with:

def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings_data = model.get_input_embeddings().weight.data
        output_embeddings_data = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings_data[-num_new_tokens:] = input_embeddings_avg
        output_embeddings_data[-num_new_tokens:] = output_embeddings_avg
pass
asphytheghoul commented 5 months ago

@danielhanchen Thank you for the quick response! Should this function be called on the model and tokenizer before patching it with LORA adapters or after ? i.e. like this :

import transformers
def smart_tokenizer_and_embedding_resize(
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    model.resize_token_embeddings(len(tokenizer))
    num_new_tokens = 15937

    if num_new_tokens > 0:
        input_embeddings_data = model.get_input_embeddings().weight.data
        output_embeddings_data = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings_data[-num_new_tokens:] = input_embeddings_avg
        output_embeddings_data[-num_new_tokens:] = output_embeddings_avg
    print("Done!")

smart_tokenizer_and_embedding_resize(tokenizer,model)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

or like this :

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

def smart_tokenizer_and_embedding_resize(
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    model.resize_token_embeddings(len(tokenizer))
    num_new_tokens = 15937

    if num_new_tokens > 0:
        input_embeddings_data = model.get_input_embeddings().weight.data
        output_embeddings_data = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings_data[-num_new_tokens:] = input_embeddings_avg
        output_embeddings_data[-num_new_tokens:] = output_embeddings_avg
    print("Done!")

smart_tokenizer_and_embedding_resize(tokenizer,model)

Thanks

danielhanchen commented 5 months ago

The first one should be correct ie:

model, tokenizer = FastLanguageModel.from_pretrained(...)
edit_tokenizer(tokenizer)
smart_tokenizer_and_embedding_resize(tokenizer, model)
model = FastLanguageModel.get_peft_model(...)
asphytheghoul commented 5 months ago

Hello @danielhanchen , I tried your suggestion and unfortunately I still get errors but I have understood the problem. When i save the trained adapters using

model.save_pretrained("name_of_model")
tokenizer.save_pretrained("name_of_model")

and try to load them again using :

from unsloth import FastLanguageModel
model,tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./name_of_model",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

the error stems due to the fact that unsloth is looking at the adapter_config.json file and at the base_model_name_or_path key. The value of this is unsloth/llama-2-7b-bnb-4bit. So it is trying to apply the adapters onto the llama-2 model which has an embedding size of (32000,4096) . That's the main cause of the error :

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([47943, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
    size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([47943, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

How do you suggest i proceed? Thanks

danielhanchen commented 5 months ago

@asphytheghoul If you are primarily using it for inference, I suggest for now to use HF's general loading mechanisms - for now I don't think I can support expanded vocabs via FastLanguageModel - I'll add a fix maybe in the next few days - but for now a quick fix is to use general HF loading. Sorry the issue is there though!

asphytheghoul commented 5 months ago

@danielhanchen Hello! I have found a solution to this problem. If anyone is facing issues, this is an expected situation that will occur and is not an issue with unsloth in any way. The reason this happens is because you are loading the base model (example : llama-2) , resizing the token embeddings and proceeding to fine-tune the model on your data. once you have finished fine-tuning, you would proceed to save the adapters. Works well so far because you trained the LoRA adapters with resized embeddings with your extended vocabulary tokenizer. Now the problem happens when you try to load it again because the adapters were trained and saved based off the LLaMA-2 model configuration, so if you inspect the adapter_config.json file, you will find the base_model_or_path key holding the value of the base model you used while fine-tuning. (In this case, meta-llama/Llama-2-7b-hf). So it looks at the configuration file of the llama-2 model and completely ignores the fact that you had resized the embeddings and tries to load the adapters you trained on this model which results in the following error :

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([47943, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
    size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([47943, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

The solution to this problem is the following :

note: this might not be the only solution but it's a workaround I explored with and found to work for my case.

Thanks!

danielhanchen commented 5 months ago

@asphytheghoul Oh yep great point / solution on merging the model to 16bit :) Not sure why I didn't mention that whoops :) But super glad you got it to work in the end!