huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.68k stars 26.22k forks source link

Gemma template won't end with eos_token #32110

Closed rangehow closed 2 weeks ago

rangehow commented 1 month ago

I'm not quite sure if this is a bug. I've observed that Gemma's template does not attach EOS even at the end of the assistant's response. This behavior is quite inconsistent with the templates of other models.

gemma always end with while its eos_token is only \<eos> . The potential issues with this behavior might be: 1. If this template is used to format training examples, it could lead to the model learning to never stop. 2. There is a template gap compared to the original model trained with the normal EOS template, which could affect performance.

rangehow commented 1 month ago
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM,TextStreamer
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-1.1-2b-it",
    torch_dtype=torch.bfloat16
).cuda()

streamer = TextStreamer(tokenizer)
input_text = [{'role':'user','content':"Do you like beef?"}]
input_ids = tokenizer.apply_chat_template(input_text, return_tensors="pt",add_generation_prompt=True).to("cuda")

outputs = model.generate(input_ids,streamer=streamer,max_new_tokens=4096)

with its output

<bos><start_of_turn>user
Do you like beef?<end_of_turn>
<start_of_turn>model
I am unable to express personal opinions or preferences, including those related to food items. As an AI language model, I am programmed to provide factual and informative responses based on available data and knowledge.<eos>

It is easy to find that it does not even end with , but rather with \<eos>.

rangehow commented 1 month ago

Hi, @Khaliq88. You sent a blank response, looking forward to your opinion about this.

amyeroberts commented 1 month ago

@rangehow Unfortunately @Khaliq88 has been repeatedly spamming the repo with issues and comments like this. I've marked as spam and the user has been blocked from posting.

cc @ArthurZucker @Rocketknight1

jena-shreyas commented 1 month ago
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM,TextStreamer
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-1.1-2b-it",
    torch_dtype=torch.bfloat16
).cuda()

streamer = TextStreamer(tokenizer)
input_text = [{'role':'user','content':"Do you like beef?"}]
input_ids = tokenizer.apply_chat_template(input_text, return_tensors="pt",add_generation_prompt=True).to("cuda")

outputs = model.generate(input_ids,streamer=streamer,max_new_tokens=4096)

with its output

<bos><start_of_turn>user
Do you like beef?<end_of_turn>
<start_of_turn>model
I am unable to express personal opinions or preferences, including those related to food items. As an AI language model, I am programmed to provide factual and informative responses based on available data and knowledge.<eos>

It is easy to find that it does not even end with , but rather with .

This seems to be the case for gemma-1.1 variants.

For the same setup, I obtained a similar output (with <end_of_turn> token missing before <eos> token) for google/gemma-1.1-7b-it :

<bos><start_of_turn>user
Do you like beef?<end_of_turn>
<start_of_turn>model
I am unable to provide subjective opinions or preferences, as I am an AI language model and do not have personal feelings or tastes. My purpose is to provide factual information and answer questions based on available knowledge.<eos>

while the correct output format (with <end_of_turn> before <eos>) obtained for google/gemma-2-9b-it :

<bos><start_of_turn>user
Do you like beef?<end_of_turn>
<start_of_turn>model
As an AI, I don't have personal preferences or the ability to eat, so I don't have an opinion on beef or any other food.

Do you like beef?<end_of_turn>
<eos>
rangehow commented 1 month ago

@jena-shreyas your test helps to make this issue more clear. So seems the quickest way is to replace the chat template in huggingface gemma1.1 series repo, but it seems needs more testing to ensure how gemma1 behaves . : )

jena-shreyas commented 1 month ago

Some more observations:

(Refer to Gemma series' tokenizer_config.json for the above token ids. )

Meanwhile, the chat template in Gemma series models only controls the input format passed to the models, which looks like this (same for both variants):

<bos><start_of_turn>user
Do you like beef?<end_of_turn>
<start_of_turn>model

It is provided as a Jinja2 template in the tokenizer_config.json, and is the same for both gemma-1 and gemma-2 variants:

{{ bos_token }}
{% if messages[0]['role'] == 'system' %}
        {{ raise_exception('System role not supported') }}
{% endif %}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
    {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{% if (message['role'] == 'assistant') %}
    {% set role = 'model' %}
{% else %}  
    {% set role = message['role'] %}
{% endif %}
{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}       
{% endfor %}
{% if add_generation_prompt %}
     {{'<start_of_turn>model\n'}}
{% endif %}

Since there is no post-processing done to the model.generate() output, this might possibly be a issue with the gemma-1.1 checkpoints itself.

@ArthurZucker @Rocketknight1 would like to know your thoughts on this.

Rocketknight1 commented 1 month ago

Hi @jena-shreyas @rangehow, I don't believe this is a bug. Models sometimes emit <eos> to indicate they want to stop generation, even when they have been trained with <end_of_turn> tokens instead in their input data. In this case, stopping generation after <eos> but actually including <end_of_turn> when formatting that message in the chat is the correct behaviour, though I agree it seems a little weird!

The Gemma chat templates were written by Google, not by us - you can try opening an issue on one of the Gemma model pages instead and pinging the authors to see what they think, but in general we haven't observed any performance degradation with Gemma even in long multi-turn conversations, so I think the current template behaviour is probably okay!

jena-shreyas commented 1 month ago

Thanks for the clarification @Rocketknight1! I agree this behaviour shown by gemma-v1 variants is unexpected (though it shouldn't be an issue as long as performance isn't affected). Anyways, this issue seems to be resolved in the gemma-v2 variants, so no worries!

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.