meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
27.02k stars 3.06k forks source link

How do models do batch inferring when using the transformer method? #114

Open code-isnot-cold opened 6 months ago

code-isnot-cold commented 6 months ago

I am a noob. Here is my code, how can I modify it to do batch inferring?


def load_model(): model_id = 'llama3/Meta-Llama-3-70B-Instruct' pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto",

return tokenizer, pipeline

return pipeline

def get_response(pipeline, system_prompt, user_prompt): messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ]

prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=4096,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

HamidShojanazeri commented 6 months ago

@ArthurZucker would the list of messages make it?

code-isnot-cold commented 6 months ago

It doesn't seem to work.
Reasons: 1) Inference time is the same as a single inference, 2) console warnings appear one by one, it can be inferred that the model is read one by one image

Here is the code for batch inference:

def load_model():
    model_id = '/home/pengwj/programs/llama3/Meta-Llama-3-70B-Instruct'
    # tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="auto",
    )
    return pipeline

#batch_system_prompt = [[],[],[],[]] ; sections = = [[],[],[],[]]
messages = [[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}] for system_prompt, user_prompt in zip(batch_system_prompt, sections)]

# prompt =  [[],[],[],[]]
prompt = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
ArthurZucker commented 5 months ago

cc @Rocketknight1

Rocketknight1 commented 5 months ago

Hi @code-isnot-cold, great question! The short answer is that the text generation pipeline will only generate one sample at a time, so you won't gain any benefit from batching samples together. If you want to generate in a batch, you'll need to use the lower-level method model.generate() instead, and it's slightly more complex. However, you can definitely get performance benefits from it.

You'll need to tokenize with padding_side="left", and padding="longest", and you'll need to set a pad_token_id. The reason for this is that the sequences will have different lengths when you batch them together. Try this code snippet:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

input1 = [{"role": "user", "content": "Hi, how are you?"}]
input2 = [{"role": "user", "content": "How are you feeling today?"}]
texts = tokenizer.apply_chat_template([input1, input2], add_generation_prompt=True, tokenize=False)

tokenizer.pad_token_id = tokenizer.eos_token_id  # Set a padding token
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.to(model.device) for key, val in inputs.items()}

model.generate(**inputs, max_new_tokens=512)
code-isnot-cold commented 5 months ago

Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker

ArthurZucker commented 5 months ago

my pleasure! 🤗

mirrorboat commented 5 months ago

I wrote my code based on @Rocketknight1 's. I am a transformers beginner and I hope that there isn't any bug in my code. Code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_id = "/path/to/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left")
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

myinput=[
    [{"role": "user", "content": "1 + 1 = "}],
    [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
    [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
    [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
    [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
    [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]
texts = tokenizer.apply_chat_template(myinput, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.cuda() for key, val in inputs.items()}
temp_texts=tokenizer.batch_decode(inputs["input_ids"], skip_special_tokens=True)

start_time = time.time()
gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=512, 
    pad_token_id=tokenizer.eos_token_id, 
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)
print(f"Time: {time.time()-start_time}")

gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
gen_text = [i[len(temp_texts[idx]):] for idx, i in enumerate(gen_text)]
print(gen_text)

Output:

Time: 2.219297409057617
['2', 'C++ is a powerful, compiled, object-oriented programming language.', 'George Washington, first president of the United States.', 'The capital of France is Paris.', 'Scattered sunlight by tiny molecules in atmosphere.', 'To find purpose, happiness, and fulfillment through experiences.', 'Immerse yourself in the language through listening and speaking.', "In your area's dormant season, typically late winter or early spring.", 'Poach it in simmering water for a perfect yolk.', 'There is no single "best" language, it depends on context.']
mirrorboat commented 5 months ago

Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker

Would you please share your llama3 vllm inference code? I've search it in https://github.com/meta-llama/llama-recipes but failed to find a suitable script.

code-isnot-cold commented 5 months ago

Sure, This is a website for your reference: https://docs.vllm.ai/en/stable/getting_started/quickstart.html. I find that vllm seems to be inferior to transformers method in batch inference. Maybe there is something wrong with my code, please communicate more after trying it

from vllm import SamplingParams, LLM
import time

model_id = "/path/to/Meta-Llama-3-70B-Instruct"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=512,)
llm = LLM(model=model_id, tensor_parallel_size=4)

tokenizer = llm.get_tokenizer()
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eos_token_id
llm.set_tokenizer(tokenizer)

prompts = [
    "1 + 1 = ",
    "Introduce C++ in one short sentence less than 10 words.",
    "Who was the first president of the United States? Answer in less than 10 words.",
    "What is the capital of France ? Answer in less than 10 words.",
    "Why is the sky blue ? Answer in less than 10 words.",
    "What is the meaning of life? Answer in less than 10 words.",
    "What is the best way to learn a new language? Answer in less than 10 words.",
    "When is the best time to plant a tree? Answer in less than 10 words.",
    "What is the best way to cook an egg? Answer in less than 10 words.",
    "Which is the best programming language? Answer in less than 10 words."
]

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print(f"Time: {time.time() - start_time}")

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
mirrorboat commented 5 months ago

https://github.com/vllm-project/vllm/issues/4180#issuecomment-2066004748 https://github.com/vllm-project/vllm/issues/4180#issuecomment-2074017550 Here @code-isnot-cold

from vllm import SamplingParams, LLM

model_path = "/path/to/Meta-Llama-3-8B-Instruct"

model = LLM(
    model=model_path,
    trust_remote_code=True,
    tensor_parallel_size=1,
)
tokenizer = model.get_tokenizer()

myinput=[
    [{"role": "user", "content": "1 + 1 = "}],
    [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
    [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
    [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
    [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
    [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]

conversations = tokenizer.apply_chat_template(
    myinput,
    tokenize=False,
)

outputs = model.generate(
    conversations,
    SamplingParams(
        temperature=0.6,
        top_p=0.9,
        max_tokens=512,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

for output in outputs:
    # prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"{generated_text!r}")
code-isnot-cold commented 5 months ago

I read the issue and tried your code, which worked perfectly. Thank you for your contribution

Xueziq commented 1 month ago

I wrote my code based on @Rocketknight1 's. I am a transformers beginner and I hope that there isn't any bug in my code. Code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_id = "/path/to/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left")
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

myinput=[
    [{"role": "user", "content": "1 + 1 = "}],
    [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
    [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
    [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
    [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
    [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
    [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]
texts = tokenizer.apply_chat_template(myinput, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.cuda() for key, val in inputs.items()}
temp_texts=tokenizer.batch_decode(inputs["input_ids"], skip_special_tokens=True)

start_time = time.time()
gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=512, 
    pad_token_id=tokenizer.eos_token_id, 
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)
print(f"Time: {time.time()-start_time}")

gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
gen_text = [i[len(temp_texts[idx]):] for idx, i in enumerate(gen_text)]
print(gen_text)

Output:

Time: 2.219297409057617
['2', 'C++ is a powerful, compiled, object-oriented programming language.', 'George Washington, first president of the United States.', 'The capital of France is Paris.', 'Scattered sunlight by tiny molecules in atmosphere.', 'To find purpose, happiness, and fulfillment through experiences.', 'Immerse yourself in the language through listening and speaking.', "In your area's dormant season, typically late winter or early spring.", 'Poach it in simmering water for a perfect yolk.', 'There is no single "best" language, it depends on context.']

i'm sorry for running this with bug,here is the bug saying: Traceback (most recent call last): File "batch_inference.py", line 26, in texts = tokenizer.apply_chat_template(myinput, add_generation_prompt=True, tokenize=False) File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1743, in apply_chat_template rendered = compiled_template.render( File "/opt/conda/lib/python3.8/site-packages/jinja2/environment.py", line 1301, in render self.environment.handle_exception() File "/opt/conda/lib/python3.8/site-packages/jinja2/environment.py", line 936, in handle_exception raise rewrite_traceback_stack(source=source) File "