tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.55k stars 2.22k forks source link

Inference Hangs #75

Open HaniItani opened 1 year ago

HaniItani commented 1 year ago

Hello,

Thank you for sharing your work.

I'm interested in evaluating alpaca-lora on QA tasks. I started with BoolQ dataset. I followed the generate.py script and constructed a prompt that works for BoolQ dataset. I'm currently doing inference with batch_size=1. I noticed that inference sometimes hangs at random for more than 5 minutes for some samples without outputing any error. I can see the GPU utilization during this time increases up to 50GBs on A100 whereas simple normal inference uses 10-11GBs. I encountered the behavior on both A100 and V100 GPUs. I made sure to use the latest version of the code. I attached my code to reproduce this behavior:


import os
import json
import torch
import transformers
from peft import PeftModel
from datasets import load_dataset
from tqdm.auto import tqdm

assert (
    "LlamaTokenizer" in transformers._import_structure["models.llama"]
), "LLaMA is now in HuggingFace's main branch.\nPlease reinstall it: pip uninstall transformers && pip install git+https://github.com/huggingface/transformers.git"
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

BASE_MODEL =
LORA_WEIGHTS = 

tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

try:
    if torch.backends.mps.is_available():
        device = "mps"
except:
    pass

if device == "cuda":
    model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16)
elif device == "mps":
    model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        device_map={"": device},
        torch_dtype=torch.float16,
    )
    model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        device_map={"": device},
        torch_dtype=torch.float16,
    )
else:
    model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL, device_map={"": device}, low_cpu_mem_usage=True
    )
    model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        device_map={"": device},
    )

def generate_prompt(instruction):
    return f"""I will give you a passage for context followed by a question. \n Here is the passage: \n {instruction["passage"]} \n 
    Answer by True or False: {instruction["question"]}?\n ### Response:"""

model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

def evaluate(
    instruction,
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    **kwargs,
):
    prompt = generate_prompt(instruction)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(device)
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=2048,
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    return output.split("### Response:")[1].strip()

predictions = []
gt = []
val_data = load_dataset("boolq", split="validation[:10%]")

for i, data in enumerate(tqdm(val_data)):
    print(i)
    gt.append(data["answer"])
    predictions.append(evaluate(data))

with open('predictions.json', 'w') as f:
    json.dump(predictions, f)

with open('gt.json', 'w') as f:
    json.dump(gt, f)```
HideLord commented 1 year ago

Yeah, seems like a general problem with max_new_tokens=2048. #73 had the same problem. Try to reduce it.

HaniItani commented 1 year ago

Thank you for your suggestion @HideLord. I decrease the max_new_tokens down to 10 since the task is boolean, and the behavior is less severe now but still there. Do you know if this is a problem with alpaca-lora or alpaca in general?

felri commented 1 year ago

@HaniItani I'm facing the same problem, I believe it's related with Lora, if you remove model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16) the prediction will ran way faster, maybe theres some steps we have to take like merging the lora model after getting finetuned correctly, but don't take my word for it.

HaniItani commented 1 year ago

Thank you for your suggestion @felri. Initially I was testing the exported huggingface checkpoint and it was hanging, then I went back to the lora model and had the same behavior.

HaniItani commented 1 year ago

@felri, I returned max_token_length=512 but loaded the exported huggingface model without quantizing it to 8 bit. The inference still hangs sometimes and the memory consumption goes well up to 33 GB, but its less severe compared to with 8 bit. However if I load the lora model without 8 bit quantization, the problem becomes much more severe.

felri commented 1 year ago

Can someone with more knowledge confirm that merging the lora model could fix this issue? I think it's the missing piece here @HaniItani

HaniItani commented 1 year ago

@felri, it does not, I already tried it. I merged the Llama weights with alpaca-lora weights and exported it to a huggingface model using the provided script. Same issue.

felri commented 1 year ago

@HaniItani did you ran using llama.cpp or alpaca.cpp? I don't think it's going to matter that much but maybe it will run faster? I'm running out of ideas, the finetune.py and generate.py are pretty straightforward, maybe is something else

HideLord commented 1 year ago

Did you guys try to stream the output? Maybe it's just never stopping for some reason.

gururise commented 1 year ago

Did you guys try to stream the output? Maybe it's just never stopping for some reason.

Interesting... During the hangs, I noticed GPU usage was at 90% and never went down. Maybe it never stops generating.

HaniItani commented 1 year ago

@HideLord , I'm not sure what you mean by streaming the output, can you please elaborate? The model seems to be stuck in the forward pass.

@gururise , yes, and the VRAM usage increases significantly too.

I tried inference using the regular alpaca repo and it seems to work fine, slower than this implementation though.

oplatek commented 1 year ago

I got a similar issue, and by just printing out output = tokenizer.decode(s) I find out that it never stops generating until it hits the max_token_length. Typically generating correctly the response but then repeating the instructions and generating nonsense. I assume it is a problem with not stopping decoding thanks to missing EOS tokens. I do not understand (yet) how the training handles padding and EOS.