unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.42k stars 1.29k forks source link

Are there any guidelines for loading a CPT (continued pre-training) model and retraining it on a different data set? #1090

Closed daegonYu closed 2 weeks ago

daegonYu commented 1 month ago

Can I load a model trained by unsloth's CPT (continued Pre-Training) method, change only the saved Lora parameters to learnable parameters, and then proceed with CPT on a different data set? In other words, I want to continue the Lora parameters of the model trained by CPT on a different data set. Are there any reference documents or guidelines? If I run the code below to continue CPT on a different data set, won't the Lora layers be created overlapping? I want to use the Lora layers created in the previous CPT step as they are.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "my_cpt_model", 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

)
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], 
pretraining
    lora_alpha = 32,
    lora_dropout = 0, 
    bias = "none",    

    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = True,   
    loftq_config = None, 
)
danielhanchen commented 1 month ago

@daegonYu Yes that should work (I think) - The continued pretraining notebook does train on the same LoRA adapters twice - https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing so it should function (hopefully)

daegonYu commented 1 month ago

If I load the Lora model from outside and train it with UnslothTrainer without get_peft_model(), I can train it with the previously generated Lora parameters. Thank you for your answer.

daegonYu commented 1 month ago

Additionally, I have a question. When learning a decoder model, I understand that when the Instruction part is input to the model, only the Response part is learned by calculating the loss, but in the colab you suggested, it is entered as learning data without such distinction. Can the model learn effectively even if it is learned like this? Also, can you tell me about a blog or paper that includes an explanation of this?

danielhanchen commented 1 month ago

@daegonYu You might be interested in our conversational notebook which masks out the instruction - https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing

Also see https://github.com/unslothai/unsloth/wiki#train-on-completions--responses-only-do-not-train-on-inputs

daegonYu commented 1 month ago

Oh this is what I was looking for. thank you!

daegonYu commented 1 month ago

One thing I'm wondering about while researching this is, is it okay to assume that using DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer) and using DataCollatorForSeq2Seq(tokenizer = tokenizer) with train_on_responses_only( trainer,

instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",

response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n", ) will have the same effect?

Here's a more detailed example code.


from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=SFTConfig(output_dir="/tmp"),
    formatting_func=formatting_prompts_func,
    data_collator=collator,
)

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
  #instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
danielhanchen commented 1 month ago

@daegonYu Sorry on the delay! Yes they're equivalent EXCEPT if you're doing more than 1 conversation. HF's one does not support it, whilst Unsloth does.

Candice1995 commented 4 weeks ago
max_seq_length

May I ask why during the DPO training , when initialize model from the SFT model, as follows: model, tokenizer = FastLanguageModel.from_pretrained( model_name = f"{args.ckpt_name}", max_seq_length = max_seq_length,
max_seq_length = 4096? But in SFT trainer, this arg is 2048, what is the relation between max_seq_length and the args used in intilizing the DPO trainer, eg, max_length, max_prompt_length = prompt_length ;

danielhanchen commented 3 weeks ago

@Candice1995 Apologies on the delay - DPO has a prompt, then 2 other fields - the accepted or rejected answer to the prompt. These fields have varying lengths, and so we have to truncate or specify the lengths for each. Unsloth's max_seq_length is the total maximum sum length of all the fields