huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

apply_chat_template() with tokenize=False returns incorrect string #1389

Closed Gnurro closed 6 months ago

Gnurro commented 7 months ago

The string returned by apply_chat_template() with tokenize=False does not match the apply_chat_template()-encoded, then decoded string with the LLama2 chat tokenizer.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                          token="hf_xxx", verbose=False)

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Lard!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

wrong_string = tokenizer.apply_chat_template(messages, tokenize=False)
tokens = tokenizer.apply_chat_template(messages, return_tensors="pt")
decoded = tokenizer.batch_decode(tokens)[0]

print(wrong_string == decoded)
ArthurZucker commented 7 months ago

The assumption that encoded == decoded is not always right. Many things can come into place and specifically for fast tokenizers where there is a know discrepancy around added tokens, see #26455. Would recommend to compare the ids not the string. 🤗

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.