Open code-isnot-cold opened 6 months ago
@ArthurZucker would the list of messages make it?
It doesn't seem to work.
Reasons:
1) Inference time is the same as a single inference,
2) console warnings appear one by one, it can be inferred that the model is read one by one
Here is the code for batch inference:
def load_model():
model_id = '/home/pengwj/programs/llama3/Meta-Llama-3-70B-Instruct'
# tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
return pipeline
#batch_system_prompt = [[],[],[],[]] ; sections = = [[],[],[],[]]
messages = [[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}] for system_prompt, user_prompt in zip(batch_system_prompt, sections)]
# prompt = [[],[],[],[]]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=1024,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
cc @Rocketknight1
Hi @code-isnot-cold, great question! The short answer is that the text generation pipeline will only generate one sample at a time, so you won't gain any benefit from batching samples together. If you want to generate in a batch, you'll need to use the lower-level method model.generate()
instead, and it's slightly more complex. However, you can definitely get performance benefits from it.
You'll need to tokenize with padding_side="left"
, and padding="longest"
, and you'll need to set a pad_token_id
. The reason for this is that the sequences will have different lengths when you batch them together. Try this code snippet:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
input1 = [{"role": "user", "content": "Hi, how are you?"}]
input2 = [{"role": "user", "content": "How are you feeling today?"}]
texts = tokenizer.apply_chat_template([input1, input2], add_generation_prompt=True, tokenize=False)
tokenizer.pad_token_id = tokenizer.eos_token_id # Set a padding token
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.to(model.device) for key, val in inputs.items()}
model.generate(**inputs, max_new_tokens=512)
Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker
my pleasure! 🤗
I wrote my code based on @Rocketknight1 's. I am a transformers beginner and I hope that there isn't any bug in my code. Code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_id = "/path/to/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left")
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
myinput=[
[{"role": "user", "content": "1 + 1 = "}],
[{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
[{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
[{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
[{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
[{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]
texts = tokenizer.apply_chat_template(myinput, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(texts, padding="longest", return_tensors="pt")
inputs = {key: val.cuda() for key, val in inputs.items()}
temp_texts=tokenizer.batch_decode(inputs["input_ids"], skip_special_tokens=True)
start_time = time.time()
gen_tokens = model.generate(
**inputs,
max_new_tokens=512,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9
)
print(f"Time: {time.time()-start_time}")
gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
gen_text = [i[len(temp_texts[idx]):] for idx, i in enumerate(gen_text)]
print(gen_text)
Output:
Time: 2.219297409057617
['2', 'C++ is a powerful, compiled, object-oriented programming language.', 'George Washington, first president of the United States.', 'The capital of France is Paris.', 'Scattered sunlight by tiny molecules in atmosphere.', 'To find purpose, happiness, and fulfillment through experiences.', 'Immerse yourself in the language through listening and speaking.', "In your area's dormant season, typically late winter or early spring.", 'Poach it in simmering water for a perfect yolk.', 'There is no single "best" language, it depends on context.']
Thank you for your detailed explanation @Rocketknight1 . I have started using the vllm method, which enables efficient inference. But I'll try to use the model.generate() method for batch generation. Thanks again for your help @ArthurZucker
Would you please share your llama3 vllm inference code? I've search it in https://github.com/meta-llama/llama-recipes but failed to find a suitable script.
Sure, This is a website for your reference: https://docs.vllm.ai/en/stable/getting_started/quickstart.html. I find that vllm seems to be inferior to transformers method in batch inference. Maybe there is something wrong with my code, please communicate more after trying it
from vllm import SamplingParams, LLM
import time
model_id = "/path/to/Meta-Llama-3-70B-Instruct"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=512,)
llm = LLM(model=model_id, tensor_parallel_size=4)
tokenizer = llm.get_tokenizer()
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eos_token_id
llm.set_tokenizer(tokenizer)
prompts = [
"1 + 1 = ",
"Introduce C++ in one short sentence less than 10 words.",
"Who was the first president of the United States? Answer in less than 10 words.",
"What is the capital of France ? Answer in less than 10 words.",
"Why is the sky blue ? Answer in less than 10 words.",
"What is the meaning of life? Answer in less than 10 words.",
"What is the best way to learn a new language? Answer in less than 10 words.",
"When is the best time to plant a tree? Answer in less than 10 words.",
"What is the best way to cook an egg? Answer in less than 10 words.",
"Which is the best programming language? Answer in less than 10 words."
]
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print(f"Time: {time.time() - start_time}")
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
https://github.com/vllm-project/vllm/issues/4180#issuecomment-2066004748 https://github.com/vllm-project/vllm/issues/4180#issuecomment-2074017550 Here @code-isnot-cold
from vllm import SamplingParams, LLM
model_path = "/path/to/Meta-Llama-3-8B-Instruct"
model = LLM(
model=model_path,
trust_remote_code=True,
tensor_parallel_size=1,
)
tokenizer = model.get_tokenizer()
myinput=[
[{"role": "user", "content": "1 + 1 = "}],
[{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}],
[{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}],
[{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}],
[{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}],
[{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}],
[{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}]
]
conversations = tokenizer.apply_chat_template(
myinput,
tokenize=False,
)
outputs = model.generate(
conversations,
SamplingParams(
temperature=0.6,
top_p=0.9,
max_tokens=512,
stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")], # KEYPOINT HERE
)
)
for output in outputs:
# prompt = output.prompt
generated_text = output.outputs[0].text
print(f"{generated_text!r}")
I read the issue and tried your code, which worked perfectly. Thank you for your contribution
I wrote my code based on @Rocketknight1 's. I am a transformers beginner and I hope that there isn't any bug in my code. Code:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer import time model_id = "/path/to/Meta-Llama-3-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left") tokenizer.pad_token_id = tokenizer.eos_token_id model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto") terminators = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>") ] myinput=[ [{"role": "user", "content": "1 + 1 = "}], [{"role": "user", "content": "Introduce C++ in one short sentence less than 10 words."}], [{"role": "user", "content": "Who was the first president of the United States? Answer in less than 10 words."}], [{"role": "user", "content": "What is the capital of France ? Answer in less than 10 words."}], [{"role": "user", "content": "Why is the sky blue ? Answer in less than 10 words."}], [{"role": "user", "content": "What is the meaning of life? Answer in less than 10 words."}], [{"role": "user", "content": "What is the best way to learn a new language? Answer in less than 10 words."}], [{"role": "user", "content": "When is the best time to plant a tree? Answer in less than 10 words."}], [{"role": "user", "content": "What is the best way to cook an egg? Answer in less than 10 words."}], [{"role": "user", "content": "Which is the best programming language? Answer in less than 10 words."}] ] texts = tokenizer.apply_chat_template(myinput, add_generation_prompt=True, tokenize=False) inputs = tokenizer(texts, padding="longest", return_tensors="pt") inputs = {key: val.cuda() for key, val in inputs.items()} temp_texts=tokenizer.batch_decode(inputs["input_ids"], skip_special_tokens=True) start_time = time.time() gen_tokens = model.generate( **inputs, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id, eos_token_id=terminators, do_sample=True, temperature=0.6, top_p=0.9 ) print(f"Time: {time.time()-start_time}") gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True) gen_text = [i[len(temp_texts[idx]):] for idx, i in enumerate(gen_text)] print(gen_text)
Output:
Time: 2.219297409057617 ['2', 'C++ is a powerful, compiled, object-oriented programming language.', 'George Washington, first president of the United States.', 'The capital of France is Paris.', 'Scattered sunlight by tiny molecules in atmosphere.', 'To find purpose, happiness, and fulfillment through experiences.', 'Immerse yourself in the language through listening and speaking.', "In your area's dormant season, typically late winter or early spring.", 'Poach it in simmering water for a perfect yolk.', 'There is no single "best" language, it depends on context.']
i'm sorry for running this with bug,here is the bug saying:
Traceback (most recent call last):
File "batch_inference.py", line 26, in
I haven't encountered this issue before. It seems that there might be a problem with the Jinja library. My suggestion is to create a new environment with python>=3.10 and then reinstall the relevant libraries, and try again.
If you have requirements for large-scale inference accuracy and speed, you can try the vLLM library. If accuracy is not a requirement and speed is prioritized, you can try Ollama.
Hi @xueziq, when I run your code, it works! That error is saying that one of your chats is malformed somehow, but in the code you pasted they're all correct.
I am a noob. Here is my code, how can I modify it to do batch inferring?
def load_model(): model_id = 'llama3/Meta-Llama-3-70B-Instruct' pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto",
return tokenizer, pipeline
def get_response(pipeline, system_prompt, user_prompt): messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ]