LLaMA FastTokenizer does not add `eos_token_id` at the end.

osainz59 commented 1 year ago

System Info

transformers version: 4.29.0.dev0
Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.28
Python version: 3.9.7
Huggingface_hub version: 0.13.3
Safetensors version: 0.3.0
PyTorch version (GPU?): 2.1.0.dev20230411+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

As mentioned on the title, the LLaMA tokenizer does not add the eos_token at the end of the inputs. This only happens on the fast version (use_fast=True).

Steps to reproduce the behaviour:

Load the LLaMA tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA_PATH, add_eos_token=True, use_fast=True)

Tokenize something

simple_sentence = "This is a sentence to test if the tokenizer adds eos token."
simple_sentence_ids = tokenizer(
simple_sentence, add_special_tokens=True
).input_ids

Print the input_ids to check if the eos_token_id (2) is added at the end.
```
print(simple_sentence_ids)
```

Output:

[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889]

Expected behavior

Expected output

[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889, 2]

Burning-XX commented 10 months ago

I also met this problem with LlamaTokenizer, what should i do to make it work。 the self.tokenizer.eos_token shows nothing。

Burning-XX commented 10 months ago

I also met this problem with LlamaTokenizer, what should i do to make it work。 the self.tokenizer.eos_token shows nothing。

transformers.version = 4.33.1

ArthurZucker commented 10 months ago

pip install -U transformers 🤗

surak commented 9 months ago

The latest version of transformers fails here, but I got it running with transformers==4.36.0

KawaiiNotHawaii commented 9 months ago

llama 2 fast tokenizer runs into the same issue here. haven't tried the slow version

ltoniazzi commented 7 months ago

@jonathangomesselman In order to control the labels -100 masks, do you think it works to modify the DataCollatorForLanguageModeling by modifying the call that creates the labels to do so only if labels are not provided? (As opposed to using DataCollatorForSeq2Seq). For example:

class CustomDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
    def torch_call(
        self, examples: List[Union[List[int], Any, Dict[str, Any]]]
    ) -> Dict[str, Any]:
        # Original Hugginface code
        # ....
        if self.mlm:
            batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
                batch["input_ids"], special_tokens_mask=special_tokens_mask
            )
        elif "labels" not in batch.keys():  # <-- Only line changed 
            labels = batch["input_ids"].clone()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            batch["labels"] = labels
        return batch

iliazlobin commented 7 months ago

I've had the same problem with gpt2, that tokenizer doesn't insert eos token at the end of a sample (by design), compare with gemma:

Thanks to @jonathangomesselman, figured a tokenizer code with batch processing, coupled it up with Trainer, and wrote an inference function for a decode-only transformer (to trim the input prompt via tensors data model).

# Dataset structure
DatasetDict({
    train: Dataset({
        features: ['src', 'tgt', 'text'],
        num_rows: 1000
    })
})

# Tokenize function and data collator
def tokenize(samples):
    # model_input = tokenizer(samples["text"], max_length=256, padding="max_length", truncation=True)

    model_inputs = tokenizer(samples["text"])
    model_reponses = tokenizer(samples["tgt"])

    new_input_ids = []
    new_labels = []
    for input_ids, response_ids in zip(model_inputs.input_ids, model_reponses.input_ids):
        # debug_labels = input_ids[-len(response_ids) :]
        # print(tokenizer.decode(input_ids, skip_special_tokens=False))
        # print(tokenizer.decode(debug_labels, skip_special_tokens=False))

        input_ids.append(tokenizer.eos_token_id)
        new_input_ids.append(input_ids)

        num_tokens_ignore = len(input_ids) - len(response_ids)
        labels = [-100] * num_tokens_ignore + input_ids[-len(response_ids) :]
        new_labels.append(labels)

    new_attention_mask = []
    for attention_mask in model_inputs.attention_mask:
        attention_mask.append(1)
        new_attention_mask.append(attention_mask)

    model_inputs["input_ids"] = new_input_ids
    model_inputs["attention_mask"] = new_attention_mask
    model_inputs["labels"] = new_labels

    return model_inputs

processed_dataset = text_dataset.map(tokenize, batched=True, remove_columns=["tgt", "src", "text"])

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token = ""

data_collator = DataCollatorForSeq2Seq(tokenizer, return_tensors="pt", padding=True)
dataloader = DataLoader(processed_dataset["train"], batch_size=4, collate_fn=data_collator)
for batch in dataloader:
    print(batch)

# To use with the trainer
trainer = Trainer(
    model=model,
    train_dataset=processed_dataset["train"],
    args=args,
    data_collator=data_collator,
)

# Inference example
input_texts = [
    f"Instruction: <text1>. Response:",
    f"Instruction: <text2>. Response:",
    f"Instruction: <text3>. Response:",
]

inputs = tokenizer(input_texts, return_tensors="pt", padding=True).to(device)

trained_model.config.use_cache = False
outputs = trained_model.generate(
    **inputs,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
    max_length=256,
)

trimmed_output = outputs[:, inputs.input_ids.shape[1] :]
trimmed_outputs = tokenizer.batch_decode(trimmed_output, skip_special_tokens=True)
print(trimmed_outputs)

Adjust for your dataset where "text" is a "full" sentence, and "tgt" is a label (a part of that sentence, what you want to model to predict)

JohnGiorgi commented 6 months ago

Adding the eos_token at the end of each training example can be activated using
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", add_eos_token = True)
Or simply:
>>> tokenizer.add_eos_token = True

This is still not working for me using Llama-3-8b

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    add_eos_token=True,
    force_download=True, 
    token=True
)
tokenizer.tokenize("this is a test", add_special_tokens=True)
>>> ['<|begin_of_text|>', 'this', 'Ġis', 'Ġa', 'Ġtest']

I am on transformers==4.40.1 and tokenizers==0.19.1

ArthurZucker commented 6 months ago

Llama3 uses a different tokenizer class. You need to update the template processor. It was never suppose to use the LlamaTokenizerFast class.

ArthurZucker commented 6 months ago

Do you need help in updating that?

JohnGiorgi commented 6 months ago

Llama3 uses a different tokenizer class. You need to update the template processor. It was never suppose to use the LlamaTokenizerFast class.

Sorry what's the template processor?

ArthurZucker commented 6 months ago

The llama3 tokenizer rely on tokenizers which supports definning template_processors. Their are responsible of adding bos or eos. For Llama3, the underlying tokenizer is PreTrainedTokenizerFast, a general class that does not really follow the logic from LlamaTokenizer not LlamaTokenizerFast. Supporting LlamaTokenizerFast would have been nice, but llama3 is tiktoken based, so we wanted to seperate these.

I'll try to natively support it, but in the mean time:

Anyway, the template processor of you tokenizer can be saved to make sure you add a bos or an eos. It's not done on the fly, but you can do something like this:

from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B")
bos = tokenizer.bos_token
eos = tokenizer.eos_token
tokenizer._tokenizer.post_processor =TemplateProcessing(
    single=f"{bos}:0 $A:0 {eos}:0",
    pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
    special_tokens=[
        (f"{bos}", tokenizer.bos_token_id), 
        (f"{eos}", tokenizer.eos_token_id)
    ],
)
tokenizer.encode("Hey")
[128000, 19182, 128001]

huggingface / transformers