phi-3's LlamaTokenizer ignores newline character.

kaitolucifer commented 1 month ago

System Info

python==3.10.14 transformers==4.42.3

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    use_fast=False,
    padding_side="right",
    model_max_length=language_model.config.max_position_embeddings,
    add_bos_token=True,
)
tokenizer.add_tokens(["<|image|>"])
input_text = "<|system|>\nYou are a helpful language and vision assistant. " +\
             "You are able to understand the visual content that the user provides, " +\
             "and assist the user with a variety of tasks using natural language.<|end|>\n" +\
             "<|user|>\n" +\
             "<|image|>\n" +\
             "What is this?<|end|>\n" +\
             "<|assistant|>\n"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
print(tokenizer.decode(input_ids[0], skip_special_tokens=True))

output:

<|system|> You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. <|end|> <|user|> <|image|> 
What is this? <|end|> <|assistant|>

It seems like tokenizer ignores newline character .

Expected behavior

output should be identical to input_text

<|system|>
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|end|>
<|user|>
<|image|>
What is this?<|end|>
<|assistant|>

amyeroberts commented 1 month ago

cc @ArthurZucker @itazap

itazap commented 1 month ago

Hello! For this model, the added tokens (<|image|>, <|user|>, etc) have a rstrip=True parameter that strips whitespace to the right of the token, which is set this way in Phi-3. You can overwrite this by re-adding the tokens with rstrip=False

from transformers import AutoTokenizer, AddedToken

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    use_fast=False,
    padding_side="right",
    add_bos_token=True,
)

# For all the tokens you need to update, you can check in tokenizer.get_added_vocab()
for token_str in ["<|system|>", "<|user|>", "<|assistant|>", "<|end|>", "<|image|>"]:
    token = AddedToken(token_str, rstrip=False)
    tokenizer.add_tokens(token)

huggingface / transformers