Closed kaitolucifer closed 1 month ago
cc @ArthurZucker @itazap
Hello! For this model, the added tokens (<|image|>, <|user|>
, etc) have a rstrip=True
parameter that strips whitespace to the right of the token, which is set this way in Phi-3. You can overwrite this by re-adding the tokens with rstrip=False
from transformers import AutoTokenizer, AddedToken
tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
use_fast=False,
padding_side="right",
add_bos_token=True,
)
# For all the tokens you need to update, you can check in tokenizer.get_added_vocab()
for token_str in ["<|system|>", "<|user|>", "<|assistant|>", "<|end|>", "<|image|>"]:
token = AddedToken(token_str, rstrip=False)
tokenizer.add_tokens(token)
System Info
python==3.10.14 transformers==4.42.3
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
output:
It seems like tokenizer ignores newline character .
Expected behavior
output should be identical to
input_text