huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
16.48k stars 1.63k forks source link

Resuming / retraining the model #593

Closed adityaaryan77 closed 1 year ago

adityaaryan77 commented 1 year ago

System Info

574 Although resume_from_checkpoint is now working after @llohann-speranca solved the issue but now finetuning again with new data using train(resume_from_checkpoint) and then testing it makes it forget the old datas i.e. wont remember the things in old dataset.

Attaching the code below:

import json
import os
import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from peft import ( 
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

MODEL_NAME = "tiiuae/falcon-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)

data = load_dataset("json", data_files="../localGPT/output.json")

def generate_prompt(data_point):
    return f"""
<human>: {data_point["question"]}
<assistant>: {data_point["answer"]}
""".strip()

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = data["train"].shuffle().map(generate_and_tokenize_prompt)

OUTPUT_DIR = "outputs"
training_args = transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        warmup_ratio=0.05,
        max_steps=80,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        save_total_limit=3,
        output_dir=OUTPUT_DIR,
        optim="paged_adamw_8bit",
        lr_scheduler_type="cosine",
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False
trainer.train(resume_from_checkpoint=True)
trainer.save_model(os.path.join(OUTPUT_DIR, "checkpoint-2"))

PEFT_MODEL = OUTPUT_DIR+"/checkpoint-2"
config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token
model = PeftModel.from_pretrained(model, PEFT_MODEL)
generation_config = model.generation_config
generation_config.max_new_tokens = 20
generation_config.temperature = 0
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

DEVICE = "cuda:0"

prompt = """
<human>:What is my cat's name?
<assistant>:
""".strip()

encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    ) 
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Who can help?

@younesbelkada @pacman100

Information

Tasks

Reproduction

I trained my model for my cats name in first iteration and saved it in checkpoint-1 then retrained it for my dogs name although now it knows my dogs name it forgets my cats name

Expected behavior

To remember my cats name

younesbelkada commented 1 year ago

hi @adityaaryan77 Thinking a bit about your issue (and if I got it right), I think this is expected. You start with a fine tuned adapter and you finetune it on a new dataset, therefore the model forgets its knowledge about the first task. You might need to create a new adapter for your new task. Also your code is a bit hard to read for me. Could you properly format it with code block and correct indendations? Let me know if I got the problem correctly

adityaaryan77 commented 1 year ago

Hi @younesbelkada No after training it on a data set I save the model, I want to further finetune it with a new dataset but I want the model to remember the old stuff or else what would be the point of saving it and train it again I can just train the base model again. Sure I am attaching the code block's paste bin here : https://pastebin.pl/view/4e77a13d

adityaaryan77 commented 1 year ago

@younesbelkada any idea how to fix this?

younesbelkada commented 1 year ago

Hi @adityaaryan77 Thanks for the ping, I am unsure how to solve this issue, as a sanity check, can you try to generate some text right after training and before saving the model (e.g. trainer.model.generate(xxx)) ?

adityaaryan77 commented 1 year ago

Hey so @younesbelkada for example I train it with Prompt:"What's my cats name" and answer:"Tom" it will reply Tom to the same question. But after training it to the prompt:"What's my dogs name?", Answer:"Bob" . Then if I ask what's my dogs name it'll reply with bob. But if I ask what my cats name it'll reply with "Bob" or maybe sometimes repeat the question

brando90 commented 1 year ago

@adityaaryan77 can you paste your error? One needs to read your entire convs and question to figure out what your issue is. The title also doesn't provide a description or mention of the specific issue either.

brando90 commented 1 year ago

perhaps related: https://github.com/huggingface/peft/issues/685 somehow this issue is able to train at all but I fail.

brando90 commented 1 year ago

original falcon qlora: https://gist.github.com/pacman100/1731b41f7a90a87b457e8c5415ff1c14

imarquart commented 1 year ago

This may very well be correct behavior depending on your data and on your training. If you train, both times, a single example text for the usual amount of steps then the model would likely reply with the currently trained name to any similar question. Essentially, you have induced catastrophic forgetting.

To diagnose this, it would help to output the logits, rank them, and output the top k token_ids. Then, over the course of fine-tuning, see how these logits change.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.