Fail to load checkpoints trained with extended tokenizer

AbnetS commented 2 weeks ago

As discussd in issue https://github.com/unslothai/unsloth/issues/154#issue-2119969174 , I am also working with extended tokenizer to accomodate words of a new language. I've merged Llama 3.2 tokenizer with my tokenizer and the size was increased to 146,452 (as opposed to 128,256, which is the size of the original Llama3.2 tokenizer). I am running a continual pretraining, and saving checkpoints at a certain number of steps. I want to finetune the checkpoints further with instructional dataset to track their performances. However, I am not able to load the checkpoints due to the mismatch in tokenizer size of the base model and the adapter. I read about the suggested solution: to merge and save the checkpoints. However, since unlsoth is automatically saving the checkpoints, I don't have the chance to do that without first loading the models. So, what should I do? Any suggestion is appreciated!

Erland366 commented 2 weeks ago

Are you using the unsloth add_new_token function?

from unsloth import add_new_tokens
add_new_tokens(model, tokenizer, new_tokens = ["<SPECIAL_TOKEN_1>", "<SPECIAL_TOKEN_2>")

I tried this and it works + I can use trainer.train(resume_from_checkpoint = True) too.

Don't forget to use this before FastLanguageModel.get_peft_model .-.

Erland366 commented 2 weeks ago

I might reproduced your problem, please let me know if this is the bug or not .-.

So what I did here, is after the checkpoint is saved. I load the model but I didn't run the add_new_tokens again. It might be the problemm? .-.

AbnetS commented 2 weeks ago

Thanks @Erland366 for the replies and the suggestions.

The error is exactly that. To answer your questions and explain what I was trying to do, I listed below some code snipplets:

I separately trained a tokenizer ("am1_tokenizer") for my local language with SentencePiece and to merge it to the Llama3.2 tokenizer ("tokenizer"), I used the following:

for p in tqdm(am1_tokenizer.pieces):
tokenizer.add_tokens(AddedToken(am1.decode(p.piece), normalized=False,special=False))
tokenizer.save_pretrained("amh_custom_tokenizer")

I run continued pretraining using text dataset for language adaptation:


from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("amh_custom_tokenizer") #_3

model,_= FastLanguageModel.from_pretrained( model_name="unsloth/Llama-3.2-3B-bnb-4bit", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit,
) model.resize_token_embeddings(len(tokenizer)) model = FastLanguageModel.get_peft_model( model, r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",

                  "embed_tokens", "lm_head",], # Add for continual pretraining
lora_alpha = 32,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none",    # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = True,  # We support rank stabilized LoRA
loftq_config = None, # And LoftQ

) from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset['train'], dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 8,

args = UnslothTrainingArguments(
    per_device_train_batch_size = 4,#2  #4
    gradient_accumulation_steps = 16, #8   #16

    #max_steps = 2000,
    warmup_ratio = 0.1,
    num_train_epochs = 1,

    learning_rate = 5e-5,
    embedding_learning_rate = 5e-6,

    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.00,
    lr_scheduler_type = "cosine",
    seed = 3407,
    output_dir = "models/llama3.2_amh_19m_3",
    save_strategy = "steps",
    save_steps = 5000,
),

) from unsloth import unsloth_train trainer_stats = trainer.train()

3. So far, so good. The training went on without problem, and saving the checkpoints every 5000 steps automatically. And even if the training is interrupted, resume_from_checkpoint = True works well like you said.
4. But what I want is while the continued pretraining is running, I want to finetune the saved checkpoints further with instructional dataset (another dataset) for a downstream task, so I tried to load one of them in another notebook as follows:
```python
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # 2048 Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model = "/path/to/checkpoint-5000"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
  )

Unsloth: Tokenizer is most likely buggy, and Unsloth failed to repair it.
It will still work, but beware of out of bounds memory accesses.
Please file an issue on the model owner's repo about this issue.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 9
      5 load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
      7 model = "/home/abnets/unsloth/models/llama3.2_amh_19m_2/checkpoint-1000"
----> 9 model, tokenizer = FastLanguageModel.from_pretrained(
     10     model_name = model,
     11     max_seq_length = max_seq_length,
     12     dtype = dtype,
     13     load_in_4bit = load_in_4bit,
     14    # resize_model_vocab = 146452
     15     # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
     16 )

File ~/unsloth/unsloth_env/lib/python3.10/site-packages/unsloth/models/loader.py:383, in FastLanguageModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map, rope_scaling, fix_tokenizer, trust_remote_code, use_gradient_checkpointing, resize_model_vocab, revision, *args, **kwargs)
    379 if is_peft:
    380     # From https://github.com/huggingface/peft/issues/184
    381     # Now add PEFT adapters
    382     model.enable_input_require_grads()
--> 383     model = PeftModel.from_pretrained(
    384         model,
    385         old_model_name,
    386         token = token,
    387         revision = revision,
    388         is_trainable = True,
    389         trust_remote_code = trust_remote_code,
    390     )
    391     # Patch it as well!
    392     model = dispatch_model.patch_peft_model(model, use_gradient_checkpointing)

File ~/unsloth/unsloth_env/lib/python3.10/site-packages/peft/peft_model.py:586, in PeftModel.from_pretrained(cls, model, model_id, adapter_name, is_trainable, config, autocast_adapter_dtype, ephemeral_gpu_offload, low_cpu_mem_usage, **kwargs)
    577 else:
    578     model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](
    579         model,
    580         config,
   (...)
    583         low_cpu_mem_usage=low_cpu_mem_usage,
    584     )
--> 586 model.load_adapter(
    587     model_id,
    588     adapter_name,
    589     is_trainable=is_trainable,
    590     autocast_adapter_dtype=autocast_adapter_dtype,
    591     low_cpu_mem_usage=low_cpu_mem_usage,
    592     **kwargs,
    593 )
    595 return model

File ~/unsloth/unsloth_env/lib/python3.10/site-packages/peft/peft_model.py:1181, in PeftModel.load_adapter(self, model_id, adapter_name, is_trainable, torch_device, autocast_adapter_dtype, ephemeral_gpu_offload, low_cpu_mem_usage, **kwargs)
   1179 # load the weights into the model
   1180 ignore_mismatched_sizes = kwargs.get("ignore_mismatched_sizes", False)
-> 1181 load_result = set_peft_model_state_dict(
   1182     self,
   1183     adapters_weights,
   1184     adapter_name=adapter_name,
   1185     ignore_mismatched_sizes=ignore_mismatched_sizes,
   1186     low_cpu_mem_usage=low_cpu_mem_usage,
   1187 )
   1188 if (
   1189     (getattr(self, "hf_device_map", None) is not None)
   1190     and (len(set(self.hf_device_map.values()).intersection({"cpu", "disk"})) > 0)
   1191     and len(self.peft_config) == 1
   1192 ):
   1193     device_map = kwargs.get("device_map", "auto")

File ~/unsloth/unsloth_env/lib/python3.10/site-packages/peft/utils/save_and_load.py:464, in set_peft_model_state_dict(model, peft_model_state_dict, adapter_name, ignore_mismatched_sizes, low_cpu_mem_usage)
    462             module._move_adapter_to_device_of_base_layer(adapter_name)
    463 else:
--> 464     load_result = model.load_state_dict(peft_model_state_dict, strict=False)
    466 if config.is_prompt_learning:
    467     model.prompt_encoder[adapter_name].embedding.load_state_dict(
    468         {"weight": peft_model_state_dict["prompt_embeddings"]}, strict=True
    469     )

File ~/unsloth/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:2584, in Module.load_state_dict(self, state_dict, strict, assign)
   2576         error_msgs.insert(
   2577             0,
   2578             "Missing key(s) in state_dict: {}. ".format(
   2579                 ", ".join(f'"{k}"' for k in missing_keys)
   2580             ),
   2581         )
   2583 if len(error_msgs) > 0:
-> 2584     raise RuntimeError(
   2585         "Error(s) in loading state_dict for {}:\n\t{}".format(
   2586             self.__class__.__name__, "\n\t".join(error_msgs)
   2587         )
   2588     )
   2589 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    size mismatch for base_model.model.model.embed_tokens.modules_to_save.default.weight: copying a param with shape torch.Size([146452, 3072]) from checkpoint, the shape in current model is torch.Size([128256, 3072]).
    size mismatch for base_model.model.lm_head.modules_to_save.default.weight: copying a param with shape torch.Size([146452, 3072]) from checkpoint, the shape in current model is torch.Size([128256, 3072]).

This is happening when I tried to load the model with FastLanguageModel.from_pretrained. If the checkpoints were merged with the base model before they were automatically saved by the trainer, this problem wouldn't have happened. The merging solves the problem as indicated in https://github.com/unslothai/unsloth/issues/154
As a workaorund, I am running the continued pretrainining only for some number of steps (like 5000, not a full epoch), so that I get the chance to merge and save it to make it ready for further finetuning, and continue the continual pretraining from where I stopped for another number of steps to get the next checkpoint.

I hope that is clear. Please let me know if I am doing something wrong, or if there is a way to automatically save the merged checkpoint.

Erland366 commented 2 weeks ago

Oh yeah, I think I can implement it

unslothai / unsloth

Fail to load checkpoints trained with extended tokenizer #1215