Open sirluk opened 1 month ago
cc @Rocketknight1 our chat template expert
cc @yonigottesman who wrote the original PR at #30650, do you have any idea what the issue here could be? If you don't have time to investigate this right now, let me know and I'll take over.
related to this https://github.com/huggingface/transformers/pull/30650#issuecomment-2279201442
and #1620 If i get time ill try to find a work around for this tokenizer issue
Got it - I wouldn't make a workaround in the template itself, because you'll need to remove the workaround again once the underlying tokenizers
issue is fixed.
for anyone struggling with the same issue atm, I created a temporary workaround for my usecase
class TokenizerCodeFeedbackHacky:
PROMPT = (
"Instruction:\nGiven a multi-turn dialogue related to a coding task, your role is to generate the assistant's next response."
"\n\nDialogue:\n"
)
CHAT_TEMPLATE_PATH = "chat_template.jinja"
def __init__(self, tokenizer_path):
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
self.tokenized_prompt = self.tokenizer(self.PROMPT, add_special_tokens=False, return_attention_mask=False)["input_ids"]
self.tokenized_prompt_len = len(self.tokenized_prompt)
chat_template = open(self.CHAT_TEMPLATE_PATH).read()
self.tokenizer.chat_template = chat_template
self.chat_template_header = "### {role}:\n" #.format(role="assistant")
def __call__(self, examples):
chats = self.tokenizer.apply_chat_template(
examples["messages"],
add_generation_prompt=False,
return_dict=False,
tokenize=False
)
chats_tokenized = self.tokenizer(chats, add_special_tokens=False, return_attention_mask=False, return_length=True, return_offsets_mapping=True)
assistant_mask = []
for i in range(len(chats)):
s, _ = zip(*chats_tokenized[i].offsets)
s = torch.tensor(s)
assistant_starts = [x.end()+1 for x in re.finditer(self.chat_template_header.format(role="assistant"), chats[i])]
assistant_ends = [x.start()-1 for x in re.finditer(self.chat_template_header.format(role="user"), chats[i])]
assistant_ends = assistant_ends[1:] + [len(chats[i])]
assistant_start_ids, assistant_end_ids = [], []
for start, end in zip(assistant_starts, assistant_ends):
assistant_start_ids.append((s > start).long().argmax().item() - 1)
assistant_end_ids.append((s > end).long().argmax().item() - 1)
assistant_end_ids = assistant_end_ids[:-1] + [chats_tokenized["length"][i]-1]
mask = [0] * chats_tokenized["length"][i]
for start_id, end_id in zip(assistant_start_ids, assistant_end_ids):
mask[start_id:end_id] = [1] * (end_id-start_id)
assistant_mask.append(mask)
input_ids = [self.tokenized_prompt + x for x in chats_tokenized["input_ids"]]
assistant_mask = [[0] * self.tokenized_prompt_len + x for x in assistant_mask]
input_length = [x + self.tokenized_prompt_len for x in chats_tokenized["length"]]
return {"input_ids": input_ids, "assistant_mask": assistant_mask, 'input_length': input_length}
System Info
transformers
version: 4.44.1Who can help?
@ArthurZucker I noticed that the apply_chat_template for the PreTrainedTokenizerBase class does not work correctly when return_assistant_tokens_mask=True. We would expect to get back a list of indices for each example where 1 indicates the token is part of an assistant message and 0 otherwise. This is the case for the Llama 2 tokenizer for example. I am sharing a minimal example to reproduce this issue.
Looking deeper into the apply_chat_template method it seems the issue is related to the char_to_token method of the tokenizers.Embedding class and could be related to the fact that the Llama 3 tokenizer was trained with tiktoken as opposed to sentencepiece.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Executing the steps to get the assistant mask in the apply chat template method shows that the char_to_token method of the tokenizers. Embedding class seems to be not working correctly.
Expected behavior
If we assume that the entire chat is 10 characters and the assistant tokens occur at indices 4-6 and 8-9 we would have an expected output that looks like this [0, 0, 0, 1, 1, 1, 0, 1, 1, 0] The actual output for the llama 3 tokenizer is always all 0s [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]