Open devzzzero opened 1 month ago
Oh apologies batched inference is currently broken sorry
There are two problems in your code. First, the llama-3 chat template itself introduces eos_token at the end of every system/user/assistant prompt, so initialize pad_token = eos_token is a bad idea (make sure you don't make this mistake when training). Second, you don't input attention_mask to mask the logits of padding tokens. In this case, padding tokens are eos_token which will be put at the start of sentences for batching, so their logits will mess up the computation.
Solution:
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3"
)
tokenizer.pad_token = '<|reserved_special_token_250|>'
tokenizer.pad_token_id = 128255
inputs = tokenizer.apply_chat_template(
dialogs,
tokenize = True,
padding = True,
truncation = False,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
attention_mask = (inputs != tokenizer.pad_token_id).int()
# in case eos_token = pad_token, the mask will be like: 1,1,1,0,1,1,1,0,1,1...
outputs = model.generate(input_ids = inputs, attention_mask=attention_mask, max_new_tokens = 768, use_cache = True)
gen_idx = len(inputs[0])
rs = tokenizer.batch_decode(outputs[:, gen_idx:], skip_special_tokens = True)
There are two problems in your code. First, the llama-3 chat template itself introduces eos_token at the end of every system/user/assistant prompt, so initialize pad_token = eos_token is a bad idea (make sure you don't make this mistake when training). Second, you don't input attention_mask to mask the logits of padding tokens. In this case, padding tokens are eos_token which will be put at the start of sentences for batching, so their logits will mess up the computation.
Solution:
tokenizer = get_chat_template( tokenizer, chat_template = "llama-3" ) tokenizer.pad_token = '<|reserved_special_token_250|>' tokenizer.pad_token_id = 128255 inputs = tokenizer.apply_chat_template( dialogs, tokenize = True, padding = True, truncation = False, add_generation_prompt = True, # Must add for generation return_tensors = "pt", ).to("cuda") attention_mask = (inputs != tokenizer.pad_token_id).int() # in case eos_token = pad_token, the mask will be like: 1,1,1,0,1,1,1,0,1,1... outputs = model.generate(input_ids = inputs, attention_mask=attention_mask, max_new_tokens = 768, use_cache = True) gen_idx = len(inputs[0]) rs = tokenizer.batch_decode(outputs[:, gen_idx:], skip_special_tokens = True)
Okay! I will try it. Thank you very much!
I'll take a look at this on my end as well!
Hi. I was also looking for a way to speed up my inference using batches but also ran into the same problem. It seems like this fix does not work as I also get nonsense outputs. using a finetuned llama 3.1 8B model.
Here is my code where I combined both the OP's code and the suggested solution.
results = []
tokenizer.pad_token = '<|reserved_special_token_250|>'
tokenizer.pad_token_id = 128255
batch_size = 4 ## <<--- TROUBLE HERE!!
convo_counter = tqdm(range(0,len(convos), batch_size), desc=f"conversation")
writer = convo_counter.write
for i in convo_counter:
end_idx = min(i+batch_size, len(convo_counter))
convos_batch = convos[i:end_idx]
answer_batch = answers[i:end_idx]
inputs = tokenizer.apply_chat_template(
convos_batch,
tokenize = True,
padding = True,
truncation=False,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
attention_mask = (inputs != tokenizer.pad_token_id).int()
# print(inputs)
res = model.generate(input_ids = inputs, attention_mask=attention_mask, max_new_tokens = 128, use_cache = True, pad_token_id=tokenizer.eos_token_id) # streamer = text_streamer,
writer(f'{len(convos_batch)=} {res.shape=}')
gen_idx = len(inputs[0])
predictions = tokenizer.batch_decode(res[:, gen_idx:], skip_special_tokens = True)
# predictions = tokenizer.batch_decode(res.to('cpu'))
for x in range(end_idx-i):
idx = x+i
prediction = predictions[x]
prediction = extract_response(prediction)
results.append((idx, prediction, answer_batch[x]))
Here is my code for the tokenizer:
convo_template = \
"{{ bos_token }}"\
"{{ ''You are a helpful assistant to the user\n'' }}"\
"{% for message in messages %}"\
"{% if message['role'] == 'user' %}"\
"{{ '### User:\n' + message['content'] + '\n' }}"\
"{% elif message['role'] == 'assistant' %}"\
"{{ '### Response:\n' + message['content'] + eos_token + '\n' }}"\
"{% endif %}"\
"{% endfor %}"\
"{% if add_generation_prompt %}"\
"{{ '### Response:\n' }}"\
"{% endif %}"
unsloth_eos_token = "eos_token"
tokenizer = get_chat_template(
tokenizer,
chat_template = (convo_template, unsloth_eos_token,), # You must provide a template and EOS token
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
When debugging I realise that it give totally nonsensical outputs when batch_size > 1. Here is an example of predictions at the first index: The fact that AI is this new meta tool that humanity has to enable us, along with all of our people, institutions
and second index: ClaimsClaimsClaims attesté que les étudiants regrettent de ne pas avoir plus réfléchi à ce qu'ils voulaient faire et de ne pas avoir plus pris de risques.<|end_of_text|>souvent, les étudiants regrettent de ne pas avoir plus réfléchi à ce qu'ils voulaient faire et de ne pas avoir plus pris de risques.<|end_of_text|>souvent, les étudiants regrettent de ne pas avoir plus réfléchi à ce qu'ils voulaient faire et de ne pas avoir plus pris de risques.<|end_of_text|>souvent, les étudiants regrettent de ne pas avoir plus
Third and fourth are equally like this. It is in french but this is weird because my training and eval data is totally english and at batch_size 1 it gives at least some intelligible english output. When I translate into english it at least make some sense, but I am not sure why it is in french.
The response to the same conversation in batch_size 1: "They wish they had taken more risks and taken more shots. They wish they had started working on their projects sooner.<|end_of_text|>### Response:\nThey wish they had taken more risks and taken more shots. They wish they had started working on their projects sooner.<|end_of_text|>### User:\nWhat is your perspective on the future of AI in education?\n### Response:\nI think this is reforming education in really positive ways. And I expect we'll see that for many other industries.<|end_of_text|>### Response:\nI think this is reforming education in really positive ways. And I expect we"
Ok, I actually found the solution. It's related to another issue in this repo regarding tokenizer padding right by default. But the generations are slightly different still but still sensible. Let me know if this is intended:
Predictions with batch_size = 1
- The fact that AI is this new meta tool that humanity has to enable us, along with all of our people, institutions, whatever, to go to greater and greater heights. I think that's the real key.
- They wish they had taken more risks and taken more shots. They wish they had started working on their projects sooner.
- I think we're not that far away from a world where AI is a major influence in society.
- Young people should not be afraid of failure. Failure is a part of the process and can provide valuable learning experiences.
Predictions with batch_size = 4
- The fact that AI is this new meta tool that humanity has to enable us, along with all of our people, institutions, whatever, to go to greater and greater heights. I think that's the real key.
- They wish they had taken more risks and taken more time to explore the world and try to do what they want to do.
- I think we're not that far away from it.
- Young people should not be afraid to take risks and fail. The sooner you get to work on your goals, the sooner you will achieve them.
Then there is also the question of why this happens.
Firstly, we have the attention mask which is supposed to inform the model to ignore the padded tokens. Then, also that during fine-tuning since the tokenizer automatically pads right, why does it only require pad right during inference time?
Just wanting to understand how it works behind the hood.
Nice question!
This occurs because the generator uses the last logits of the sequence to predict the next token. When you pad to the right of the sentence, the logits from the pad token are used to compute the next token. Since the logits of the pad token are apparently random, it is understandable that the generated sentence turns out to be nonsensical.
Ye sorry so for finetuning, we do right padding, so
[A, B, C, X, X, X]
[D, E, X, X, X, X]
But for inference, you have to call FastLanguageModel.for_inference(model)
to force left padding. As @binhmed2lab said, the generation / decoding step uses the right most logits, so
[A, B, C, X, X, X]
[D, E, X, X, X, X]
will use X and X which is wrong.
Instead of left padding, we get
[X, X, X, A, B, C]
[X, X, X, X, D, E]
which is now correct
So this is a strange one. I am stumped.
In way, this is sort of like #416, but I confirmed that if Batch==1, then the problem does not occur. (See below)
My inference loop looks like this
Now, I load up DS1, which is just an array of json objects that look like
[ {"conversations":[{"from":"human","value": "QUERY" },{"from":"gpt","value":"RESPONSE"} ] ... } ...]
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
now here is the loop
If I set Batch to a value greater than 1, the model barfs and its output is padded to
max_new_tokens
with'<|eot_id|>'
which is tokenizer.eos_token_id (128009)
Not only that, the inference takes about 3-5x as long as a Batch=1 inference and the gpu is definitely being taxed.
In way, this is sort of like #416, but I confirmed that if Batch==1, then the problem does not occur. The base model that I used for PEFT is "unsloth/llama-3-8b-Instruct"
So the possibilities are:
apply_chat_template
??Profuse apologies in advance in case it's (1)
Relevant versions: