huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.05k stars 26.3k forks source link

phi-3's LlamaTokenizer ignores newline character. #32136

Closed kaitolucifer closed 1 month ago

kaitolucifer commented 1 month ago

System Info

python==3.10.14 transformers==4.42.3

Who can help?

No response

Information

Tasks

Reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    use_fast=False,
    padding_side="right",
    model_max_length=language_model.config.max_position_embeddings,
    add_bos_token=True,
)
tokenizer.add_tokens(["<|image|>"])
input_text = "<|system|>\nYou are a helpful language and vision assistant. " +\
             "You are able to understand the visual content that the user provides, " +\
             "and assist the user with a variety of tasks using natural language.<|end|>\n" +\
             "<|user|>\n" +\
             "<|image|>\n" +\
             "What is this?<|end|>\n" +\
             "<|assistant|>\n"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
print(tokenizer.decode(input_ids[0], skip_special_tokens=True))

output:

<|system|> You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. <|end|> <|user|> <|image|> 
What is this? <|end|> <|assistant|>

It seems like tokenizer ignores newline character .

Expected behavior

output should be identical to input_text

<|system|>
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|end|>
<|user|>
<|image|>
What is this?<|end|>
<|assistant|>
amyeroberts commented 1 month ago

cc @ArthurZucker @itazap

itazap commented 1 month ago

Hello! For this model, the added tokens (<|image|>, <|user|>, etc) have a rstrip=True parameter that strips whitespace to the right of the token, which is set this way in Phi-3. You can overwrite this by re-adding the tokens with rstrip=False

from transformers import AutoTokenizer, AddedToken

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    use_fast=False,
    padding_side="right",
    add_bos_token=True,
)

# For all the tokens you need to update, you can check in tokenizer.get_added_vocab()
for token_str in ["<|system|>", "<|user|>", "<|assistant|>", "<|end|>", "<|image|>"]:
    token = AddedToken(token_str, rstrip=False)
    tokenizer.add_tokens(token)