huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
15.21k stars 1.46k forks source link

Using Pipeline and TGI having the LoRA Adapter. #1468

Closed wolfassi123 closed 5 months ago

wolfassi123 commented 5 months ago

System Info

Python 3.10.12 peft @ git+https://github.com/huggingface/peft.git@25dec602f306d52b6cc078ec8353ba6eac249097 transformers @ git+https://github.com/huggingface/transformers.git@8a0ed0a9a2ee8712b2e2c3b20da2887ef7c55fe6 accelerate==0.27.2

Who can help?

No response

Information

Tasks

Reproduction

from typing import Optional, Any

import torch

from transformers.utils import is_accelerate_available, is_bitsandbytes_available
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    GenerationConfig,
    pipeline,
)
from peft import PeftModel, prepare_model_for_int8_training, LoraConfig, get_peft_model, TaskType, AutoPeftModelForCausalLM, PeftModelForSeq2SeqLM, AutoPeftModelForSeq2SeqLM

import torch

hf_write_token = "hf_testtoken"

def load_adapted_hf_generation_pipeline(
    base_model_name,
    lora_model_name,
    temperature: float = 0,
    top_p: float = 1.,
    max_tokens: int = 64,
    batch_size: int = 2,
    device: str = "cuda",
    load_in_8bit: bool = False,
    generation_kwargs: Optional[dict] = None,
):

    if device == "cuda":
        if not is_accelerate_available():
            raise ValueError("Install `accelerate`")
    if load_in_8bit and not is_bitsandbytes_available():
            raise ValueError("Install `bitsandbytes`")

    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    task = "text2text-generation"

    if device == "cuda":
        model = AutoModelForSeq2SeqLM.from_pretrained(
            base_model_name,
            load_in_8bit=load_in_8bit,
            torch_dtype=torch.float16,
            device_map="auto",
            token=hf_write_token,
        )
        model = AutoPeftModelForSeq2SeqLM.from_pretrained(
            model,
            lora_model_name,
            torch_dtype=torch.float16,
            token=hf_write_token,
        )
    elif device == "mps":
        model = AutoModelForSeq2SeqLM.from_pretrained(
            base_model_name,
            device_map={"": device},
            torch_dtype=torch.float16,
            token=hf_write_token,
        )
        model = AutoPeftModelForSeq2SeqLM.from_pretrained(
            model,
            lora_model_name,
            device_map={"": device},
            torch_dtype=torch.float16,
            token=hf_write_token,
        )
    else:
        model = AutoModelForSeq2SeqLM.from_pretrained(
            base_model_name, device_map={"": device}, low_cpu_mem_usage=True, token=hf_write_token,
        )
        model = AutoPeftModelForSeq2SeqLM.from_pretrained(
            model,
            lora_model_name,
            device_map={"": device},
            token=hf_write_token,
        )

    model.config.pad_token_id = tokenizer.pad_token_id = 0
    model.config.bos_token_id = 1
    model.config.eos_token_id = 2

    if not load_in_8bit:
        model.half()

    model.eval()

    generation_kwargs = generation_kwargs if generation_kwargs is not None else {}
    config = GenerationConfig(
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_tokens,
        top_p=top_p,
        **generation_kwargs,
    )
    pipe = pipeline(
        task,
        model=model,
        tokenizer=tokenizer,
        batch_size=16,
        generation_config=config,
        framework="pt",
    )

    return pipe

if __name__ == "__main__":
    pipe = load_adapted_hf_generation_pipeline(
    base_model_name="google/flan-t5-large",
    lora_model_name="test/test",
    )

    print(pipe(conversations))

Expected behavior

I had just trained my first LoRA model but I believe that I might have missed something. After training a Flan-T5-Large model, I tested it and it was working perfectly when I decoded the output I got using the following bit of code:

model.eval()

for i, text in enumerate(conversations):
  print(f"{i}: {text}")
  inputs = tokenizer(text, return_tensors="pt")
  outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=64)
  print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

I decided that I wanted to test its deployment using TGI. I managed to deploy the base Flan-T5-Large model from Google using TGI as it was pretty straightforward. But When I came to test the LoRA model I got while using pipeline, the model underperformed heavily. I simply load the checkpoint from using pipeline and set the task to "_text2textgeneration" I noticed that when I trained my LoRA model, I did not get a “config.json” file, I got an “adapter_config.json” file. I understood that what I basically had was only the adapter. I don’t know if that is one of the reason, as after training I did more research concerning LoRA and I noticed that in the documention they had mentioned “merging” and “loading” between the base model and the LoRA, which I did not do at the start. I basically trained and got several checkpoints for each epoch. Tested the checkpoint that had the best metrics and pushed it to my private hub. These are the files that I have pushed to my hub:

While trying to avoid re-training, how can I deploy the LoRA model to test properly using Pipeline so that I can also deploy it on TGI? Ultimately I want to be able to ping the model I got from the LoRA adapter using pipeline, so that I can ultimately deploy it using TGI.

N.B.: The code I used for the "_load_adapted_hf_generationpipeline" function was inspired from the following github post: https://gist.github.com/ahoho/ba41c42984faf64bf4302b2b1cd7e0ce

BenjaminBossan commented 5 months ago

the model underperformed heavily

Could you be more precise? Do you mean that the results were random or the same as just the base model without fine-tuning or somewhere in-between?

One thing you could test would be to merge the LoRA weights into the base model before saving it:

model_merged = model.merge_and_unload()
model_merged.save_pretrained(...)

This should give you the full model, including the config.json. When you load this model into TGI, do you get results on par with your expectation?

wolfassi123 commented 5 months ago

Yep that was it. When using model.merge_and_unload() I manage to merge the LoRA Adapter with the base model and was able to use ping the model using pipeline correctly. The model performance was the same as when I trained with the LoRA adapter the first time. Apparently earlier it was only loading the adapter and nothing which explains the performance drop.

BenjaminBossan commented 5 months ago

Great, thanks for testing this. I'll close the issue then, feel free to re-open if something else comes up.