huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.87k stars 26.98k forks source link

[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

Open kiva12138 opened 5 months ago

kiva12138 commented 5 months ago

System Info

Platform = Windows PyTorch = 2.3.0 Transformers = 4.41.0

Who can help?

No response

Information

Tasks

Reproduction

import torch
from transformers import AutoTokenizer

LLaMAPath = '/path/to/llama3-8b'

# The following two yields the same results, all of them contains BOS token and no EOS token
tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=True, add_eos_token=True)
# tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=False, add_eos_token=False)

tokenizer.add_special_tokens({"pad_token": "<|reserved_special_token_0|>"}) 
inputs = tokenizer(['hi, how are you today?'], padding=True, return_tensors='pt')
print(inputs)

All of the statements above produce [128000, 6151, 11, 1268, 527, 499, 3432, 30]

Expected behavior

I think when using tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=True, add_eos_token=True), we get [128000, 6151, 11, 1268, 527, 499, 3432, 30, 128001],

when using tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=False, add_eos_token=False), we get [6151, 11, 1268, 527, 499, 3432, 30],

eyloncaplan commented 5 months ago

I'm having the same issue. Neither of these change the encodings: tokenizer.add_bos_token = False tokenizer.add_eos_token = True

amyeroberts commented 5 months ago

cc @ArthurZucker

ArthurZucker commented 5 months ago

Hey! This is related to #30607, the tokenizer for Llama3 is a PreTrainedTokenizerFast, not the LLamaTokenizer or a LlamaTokenizerFast. Though it might actually be good to support an easy way to add bos and eos. Currently what you have to do is update the TemplateProcessor which is fairly annoying (not beginner friendly).

That's something which should be handle on the tokenizers side

eyloncaplan commented 5 months ago

Hey! This is related to #30607, the tokenizer for Llama3 is a PreTrainedTokenizerFast, not the LLamaTokenizer or a LlamaTokenizerFast. Though it might actually be good to support an easy way to add bos and eos. Currently what you have to do is update the TemplateProcessor which is fairly annoying (not beginner friendly).

That's something which should be handle on the tokenizers side

@ArthurZucker I think it's called TemplateProcessing, not TemplateProcessor. For those wondering this is how I used it to get the tokenizer to put the eos token:

bos = "<|begin_of_text|>"
eos = "<|end_of_text|>"
tokenizer._tokenizer.post_processor = processors.Sequence(
    [
        processors.ByteLevel(trim_offsets=False),
        processors.TemplateProcessing(
            single=f"{bos}:0 $A:0 {eos}:0",
            pair=f"{bos}:0 $A:0 {bos}:1 $B:1 {eos}:1",
            special_tokens=[
                (bos, tokenizer.bos_token_id),
                (eos, tokenizer.eos_token_id),
            ],
        ),
    ]
)

Now I'm worried that the padding tokens won't get added properly, but that's a different issue...

ArthurZucker commented 5 months ago

Padding token is unrelated, it's added if you ask the tokenizer to pad the input! And yes, thanks for providing the snippet @eyloncaplan 😉

kddubey commented 1 week ago

In case anyone else is blocked by this issue, I copied code from #31316 into a function which patches the tokenizer to support dynamically setting add_bos_token and add_eos_token.

Running this script— ```python from transformers import AutoTokenizer model_id = "yujiepan/llama-3.1-tiny-random" text = "a b" print("Load plain tokenizer\n") tokenizer = AutoTokenizer.from_pretrained(model_id) print(" Default:", tokenizer(text)["input_ids"]) tokenizer.add_eos_token = True print(" Add EOS:", tokenizer(text)["input_ids"]) print("\nLoad and patch tokenizer\n") tokenizer2 = AutoTokenizer.from_pretrained(model_id) force_support(tokenizer2) tokenizer2.add_eos_token = True print(" Add EOS:", tokenizer2(text)["input_ids"]) tokenizer2.add_eos_token = False print("Don't add:", tokenizer2(text)["input_ids"]) ```

—prints:

Load plain tokenizer

   Default: [128000, 64, 293]
   Add EOS: [128000, 64, 293]

Load and patch tokenizer

  Add EOS: [128000, 64, 293, 128009]
Don't add: [128000, 64, 293]