Closed osainz59 closed 1 year ago
I also met this problem with LlamaTokenizer, what should i do to make it work。 the self.tokenizer.eos_token shows nothing。
I also met this problem with LlamaTokenizer, what should i do to make it work。 the self.tokenizer.eos_token shows nothing。
transformers.version = 4.33.1
pip install -U transformers
🤗
The latest version of transformers fails here, but I got it running with transformers==4.36.0
llama 2 fast tokenizer runs into the same issue here. haven't tried the slow version
@jonathangomesselman In order to control the labels -100
masks, do you think it works to modify the DataCollatorForLanguageModeling
by modifying the call that creates the labels to do so only if labels are not provided? (As opposed to using DataCollatorForSeq2Seq
). For example:
class CustomDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
def torch_call(
self, examples: List[Union[List[int], Any, Dict[str, Any]]]
) -> Dict[str, Any]:
# Original Hugginface code
# ....
if self.mlm:
batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
batch["input_ids"], special_tokens_mask=special_tokens_mask
)
elif "labels" not in batch.keys(): # <-- Only line changed
labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels
return batch
I've had the same problem with gpt2, that tokenizer doesn't insert eos token at the end of a sample (by design), compare with gemma:
Thanks to @jonathangomesselman, figured a tokenizer code with batch processing, coupled it up with Trainer, and wrote an inference function for a decode-only transformer (to trim the input prompt via tensors data model).
# Dataset structure
DatasetDict({
train: Dataset({
features: ['src', 'tgt', 'text'],
num_rows: 1000
})
})
# Tokenize function and data collator
def tokenize(samples):
# model_input = tokenizer(samples["text"], max_length=256, padding="max_length", truncation=True)
model_inputs = tokenizer(samples["text"])
model_reponses = tokenizer(samples["tgt"])
new_input_ids = []
new_labels = []
for input_ids, response_ids in zip(model_inputs.input_ids, model_reponses.input_ids):
# debug_labels = input_ids[-len(response_ids) :]
# print(tokenizer.decode(input_ids, skip_special_tokens=False))
# print(tokenizer.decode(debug_labels, skip_special_tokens=False))
input_ids.append(tokenizer.eos_token_id)
new_input_ids.append(input_ids)
num_tokens_ignore = len(input_ids) - len(response_ids)
labels = [-100] * num_tokens_ignore + input_ids[-len(response_ids) :]
new_labels.append(labels)
new_attention_mask = []
for attention_mask in model_inputs.attention_mask:
attention_mask.append(1)
new_attention_mask.append(attention_mask)
model_inputs["input_ids"] = new_input_ids
model_inputs["attention_mask"] = new_attention_mask
model_inputs["labels"] = new_labels
return model_inputs
processed_dataset = text_dataset.map(tokenize, batched=True, remove_columns=["tgt", "src", "text"])
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token = ""
data_collator = DataCollatorForSeq2Seq(tokenizer, return_tensors="pt", padding=True)
dataloader = DataLoader(processed_dataset["train"], batch_size=4, collate_fn=data_collator)
for batch in dataloader:
print(batch)
# To use with the trainer
trainer = Trainer(
model=model,
train_dataset=processed_dataset["train"],
args=args,
data_collator=data_collator,
)
# Inference example
input_texts = [
f"Instruction: <text1>. Response:",
f"Instruction: <text2>. Response:",
f"Instruction: <text3>. Response:",
]
inputs = tokenizer(input_texts, return_tensors="pt", padding=True).to(device)
trained_model.config.use_cache = False
outputs = trained_model.generate(
**inputs,
do_sample=True,
top_k=10,
num_return_sequences=1,
pad_token_id=tokenizer.eos_token_id,
max_length=256,
)
trimmed_output = outputs[:, inputs.input_ids.shape[1] :]
trimmed_outputs = tokenizer.batch_decode(trimmed_output, skip_special_tokens=True)
print(trimmed_outputs)
Adjust for your dataset where "text" is a "full" sentence, and "tgt" is a label (a part of that sentence, what you want to model to predict)
Adding the
eos_token
at the end of each training example can be activated using>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", add_eos_token = True)
Or simply:
>>> tokenizer.add_eos_token = True
This is still not working for me using Llama-3-8b
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
add_eos_token=True,
force_download=True,
token=True
)
tokenizer.tokenize("this is a test", add_special_tokens=True)
>>> ['<|begin_of_text|>', 'this', 'Ġis', 'Ġa', 'Ġtest']
I am on transformers==4.40.1
and tokenizers==0.19.1
Llama3 uses a different tokenizer class. You need to update the template processor. It was never suppose to use the LlamaTokenizerFast class.
Do you need help in updating that?
Llama3 uses a different tokenizer class. You need to update the template processor. It was never suppose to use the LlamaTokenizerFast class.
Sorry what's the template processor?
The llama3 tokenizer rely on tokenizers
which supports definning template_processors
.
Their are responsible of adding bos or eos. For Llama3, the underlying tokenizer is PreTrainedTokenizerFast
, a general class that does not really follow the logic from LlamaTokenizer
not LlamaTokenizerFast
. Supporting LlamaTokenizerFast
would have been nice, but llama3
is tiktoken based, so we wanted to seperate these.
I'll try to natively support it, but in the mean time:
Anyway, the template processor of you tokenizer can be saved to make sure you add a bos or an eos. It's not done on the fly, but you can do something like this:
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B")
bos = tokenizer.bos_token
eos = tokenizer.eos_token
tokenizer._tokenizer.post_processor =TemplateProcessing(
single=f"{bos}:0 $A:0 {eos}:0",
pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
special_tokens=[
(f"{bos}", tokenizer.bos_token_id),
(f"{eos}", tokenizer.eos_token_id)
],
)
tokenizer.encode("Hey")
[128000, 19182, 128001]
System Info
transformers
version: 4.29.0.dev0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
As mentioned on the title, the LLaMA tokenizer does not add the
eos_token
at the end of the inputs. This only happens on the fast version (use_fast=True
).Steps to reproduce the behaviour:
input_ids
to check if theeos_token_id
(2
) is added at the end.Expected behavior
Expected output