Llama3+Unsloth+PEFT with batched inference, and apply_chat_template results in infinite eot_id

devzzzero commented 1 month ago

So this is a strange one. I am stumped.

In way, this is sort of like #416, but I confirmed that if Batch==1, then the problem does not occur. (See below)

My inference loop looks like this

from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
import torch
import os.path as path
import json

max_seq_length = 8192 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False
mergedPath = ....
model, tokenizer = FastLanguageModel.from_pretrained(model_name = mergedPath,
                                                     max_seq_length = max_seq_length,
                                                     dtype=dtype,
                                                     load_in_4bit=load_in_4bit)
tokenizer = get_chat_template(
  tokenizer,
  chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
  mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

def formatting_prompts_func(examples):
  convos = examples["conversations"]
  texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
  return { "text" : texts, }

Now, I load up DS1, which is just an array of json objects that look like

[ {"conversations":[{"from":"human","value": "QUERY" },{"from":"gpt","value":"RESPONSE"} ] ... } ...]

Basically, DS1 is my analog of dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")

Queries = [[{ "from": "human", "value" :m['conversations'][0]['value'] }] for m in  list(DS1) ]
Answers = ...

now here is the loop

FastLanguageModel.for_inference(model)
Results = []
Batch = 1 ## <<--- TROUBLE HERE!!
DS1I = tqdm(range(0,len(DS1), Batch), desc=f"conversation")
writer = DS1I.write
for i in DS1I:
L = min(i+Batch, len(DS1))
Q = Queries[i:L]
A = Answers[i:L]
T = Texts[i:L]
inputs = tokenizer.apply_chat_template(
  Q,
  tokenize = True,
  padding = True,
  add_generation_prompt = True, # Must add for generation
  return_tensors = "pt",
).to("cuda")
# print(inputs)
Res = model.generate(input_ids = inputs, max_new_tokens = 4096, use_cache = True, pad_token_id=tokenizer.eos_token_id) #  streamer = text_streamer,
writer(f'{len(Q)=} {getShapeOf([Q])=} {Res.shape=}')
Ans = tokenizer.batch_decode(Res.to('cpu'))
Results.append((i, Ans, T))
if i > 8:
break

If I set Batch to a value greater than 1, the model barfs and its output is padded to `max_new_tokens` with `'<|eot_id|>'`

which is tokenizer.eos_token_id (128009)

Not only that, the inference takes about 3-5x as long as a Batch=1 inference and the gpu is definitely being taxed.

i.e. Batch = 2 takes about 3-4x LONGER than Batch=1 inference

In way, this is sort of like #416, but I confirmed that if Batch==1, then the problem does not occur. The base model that I used for PEFT is "unsloth/llama-3-8b-Instruct"

So the possibilities are:

It's something that I did wrong
It's an unsloth issue a. something went wrong in apply_chat_template ??
It's an transformers.AutoModel issue. I am unable to verify because I am unable to load the model without unsloth.
It's a PEFT issue.

Profuse apologies in advance in case it's (1)

Relevant versions:

unsloth==2024.5
transformers==4.41.2

danielhanchen commented 1 month ago

Oh apologies batched inference is currently broken sorry

binhmed2lab commented 1 month ago

There are two problems in your code. First, the llama-3 chat template itself introduces eos_token at the end of every system/user/assistant prompt, so initialize pad_token = eos_token is a bad idea (make sure you don't make this mistake when training). Second, you don't input attention_mask to mask the logits of padding tokens. In this case, padding tokens are eos_token which will be put at the start of sentences for batching, so their logits will mess up the computation.

Solution:

tokenizer = get_chat_template(
      tokenizer,
      chat_template = "llama-3"
  )
tokenizer.pad_token = '<|reserved_special_token_250|>'
tokenizer.pad_token_id = 128255

inputs = tokenizer.apply_chat_template(
      dialogs,
      tokenize = True,
      padding = True,
      truncation = False,
      add_generation_prompt = True, # Must add for generation
      return_tensors = "pt",
).to("cuda")
attention_mask = (inputs != tokenizer.pad_token_id).int()
# in case eos_token = pad_token, the mask will be like: 1,1,1,0,1,1,1,0,1,1...

outputs = model.generate(input_ids = inputs, attention_mask=attention_mask, max_new_tokens = 768, use_cache = True)
gen_idx = len(inputs[0])
rs = tokenizer.batch_decode(outputs[:, gen_idx:], skip_special_tokens = True)

devzzzero commented 1 month ago

There are two problems in your code. First, the llama-3 chat template itself introduces eos_token at the end of every system/user/assistant prompt, so initialize pad_token = eos_token is a bad idea (make sure you don't make this mistake when training). Second, you don't input attention_mask to mask the logits of padding tokens. In this case, padding tokens are eos_token which will be put at the start of sentences for batching, so their logits will mess up the computation.

Solution:
tokenizer = get_chat_template(
      tokenizer,
      chat_template = "llama-3"
  )
tokenizer.pad_token = '<|reserved_special_token_250|>'
tokenizer.pad_token_id = 128255

inputs = tokenizer.apply_chat_template(
      dialogs,
      tokenize = True,
      padding = True,
      truncation = False,
      add_generation_prompt = True, # Must add for generation
      return_tensors = "pt",
).to("cuda")
attention_mask = (inputs != tokenizer.pad_token_id).int()
# in case eos_token = pad_token, the mask will be like: 1,1,1,0,1,1,1,0,1,1...

outputs = model.generate(input_ids = inputs, attention_mask=attention_mask, max_new_tokens = 768, use_cache = True)
gen_idx = len(inputs[0])
rs = tokenizer.batch_decode(outputs[:, gen_idx:], skip_special_tokens = True)

Okay! I will try it. Thank you very much!

danielhanchen commented 1 month ago

I'll take a look at this on my end as well!

ollayf commented 1 month ago

Hi. I was also looking for a way to speed up my inference using batches but also ran into the same problem. It seems like this fix does not work as I also get nonsense outputs. using a finetuned llama 3.1 8B model.

Here is my code where I combined both the OP's code and the suggested solution.

results = []

tokenizer.pad_token = '<|reserved_special_token_250|>'
tokenizer.pad_token_id = 128255

batch_size = 4 ## <<--- TROUBLE HERE!!
convo_counter = tqdm(range(0,len(convos), batch_size), desc=f"conversation")
writer = convo_counter.write
for i in convo_counter:
    end_idx = min(i+batch_size, len(convo_counter))
    convos_batch = convos[i:end_idx]
    answer_batch = answers[i:end_idx]
    inputs = tokenizer.apply_chat_template(
        convos_batch,
        tokenize = True,
        padding = True,
        truncation=False,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")
    attention_mask = (inputs != tokenizer.pad_token_id).int()
    # print(inputs)
    res = model.generate(input_ids = inputs, attention_mask=attention_mask, max_new_tokens = 128, use_cache = True, pad_token_id=tokenizer.eos_token_id) #  streamer = text_streamer,
    writer(f'{len(convos_batch)=} {res.shape=}')
    gen_idx = len(inputs[0])
    predictions = tokenizer.batch_decode(res[:, gen_idx:], skip_special_tokens = True)
    # predictions = tokenizer.batch_decode(res.to('cpu'))
    for x in range(end_idx-i):
        idx = x+i
        prediction = predictions[x]
        prediction = extract_response(prediction)
        results.append((idx, prediction, answer_batch[x]))

Here is my code for the tokenizer:

convo_template = \
    "{{ bos_token }}"\
    "{{ ''You are a helpful assistant to the user\n'' }}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ '### User:\n' + message['content'] + '\n' }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ '### Response:\n' + message['content'] + eos_token + '\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '### Response:\n' }}"\
    "{% endif %}"
unsloth_eos_token = "eos_token"

tokenizer = get_chat_template(
    tokenizer,
    chat_template = (convo_template, unsloth_eos_token,), # You must provide a template and EOS token
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

When debugging I realise that it give totally nonsensical outputs when batch_size > 1. Here is an example of predictions at the first index: The fact that AI is this new meta tool that humanity has to enable us, along with all of our people, institutions

and second index: ClaimsClaimsClaims attesté que les étudiants regrettent de ne pas avoir plus réfléchi à ce qu'ils voulaient faire et de ne pas avoir plus pris de risques.<|end_of_text|>souvent, les étudiants regrettent de ne pas avoir plus réfléchi à ce qu'ils voulaient faire et de ne pas avoir plus pris de risques.<|end_of_text|>souvent, les étudiants regrettent de ne pas avoir plus réfléchi à ce qu'ils voulaient faire et de ne pas avoir plus pris de risques.<|end_of_text|>souvent, les étudiants regrettent de ne pas avoir plus

Third and fourth are equally like this. It is in french but this is weird because my training and eval data is totally english and at batch_size 1 it gives at least some intelligible english output. When I translate into english it at least make some sense, but I am not sure why it is in french.

The response to the same conversation in batch_size 1: "They wish they had taken more risks and taken more shots. They wish they had started working on their projects sooner.<|end_of_text|>### Response:\nThey wish they had taken more risks and taken more shots. They wish they had started working on their projects sooner.<|end_of_text|>### User:\nWhat is your perspective on the future of AI in education?\n### Response:\nI think this is reforming education in really positive ways. And I expect we'll see that for many other industries.<|end_of_text|>### Response:\nI think this is reforming education in really positive ways. And I expect we"

ollayf commented 1 month ago

Ok, I actually found the solution. It's related to another issue in this repo regarding tokenizer padding right by default. But the generations are slightly different still but still sensible. Let me know if this is intended:

Predictions with batch_size = 1

- The fact that AI is this new meta tool that humanity has to enable us, along with all of our people, institutions, whatever, to go to greater and greater heights. I think that's the real key.
- They wish they had taken more risks and taken more shots. They wish they had started working on their projects sooner.
- I think we're not that far away from a world where AI is a major influence in society.
- Young people should not be afraid of failure. Failure is a part of the process and can provide valuable learning experiences.

Predictions with batch_size = 4

- The fact that AI is this new meta tool that humanity has to enable us, along with all of our people, institutions, whatever, to go to greater and greater heights. I think that's the real key.
- They wish they had taken more risks and taken more time to explore the world and try to do what they want to do.
- I think we're not that far away from it.
- Young people should not be afraid to take risks and fail. The sooner you get to work on your goals, the sooner you will achieve them.

ollayf commented 1 month ago

Then there is also the question of why this happens.

Firstly, we have the attention mask which is supposed to inform the model to ignore the padded tokens. Then, also that during fine-tuning since the tokenizer automatically pads right, why does it only require pad right during inference time?

Just wanting to understand how it works behind the hood.

binhmed2lab commented 1 month ago

Nice question!

This occurs because the generator uses the last logits of the sequence to predict the next token. When you pad to the right of the sentence, the logits from the pad token are used to compute the next token. Since the logits of the pad token are apparently random, it is understandable that the generated sentence turns out to be nonsensical.

danielhanchen commented 1 month ago

Ye sorry so for finetuning, we do right padding, so [A, B, C, X, X, X] [D, E, X, X, X, X] But for inference, you have to call FastLanguageModel.for_inference(model) to force left padding. As @binhmed2lab said, the generation / decoding step uses the right most logits, so [A, B, C, X, X, X] [D, E, X, X, X, X] will use X and X which is wrong. Instead of left padding, we get [X, X, X, A, B, C] [X, X, X, X, D, E] which is now correct

unslothai / unsloth