AutoTokenizer: Phi-3 drops spaces when decodes a token at a time

Andrei-Aksionov commented 3 weeks ago

System Info

transformers version: 4.41.2
Platform: macOS-14.5-x86_64-i386-64bit
Python version: 3.11.6
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

phi_2_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_3_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

for name, tokenizer in (("phi-2", phi_2_tokenizer), ("phi-3", phi_3_tokenizer)):
    print(f"Tokenizer: {name}")
    tokens = tokenizer.encode("This is a test string")
    print(f"{tokens=}")
    print(tokenizer.decode(tokens))
    print("".join([tokenizer.decode(token) for token in tokens]))
    print("-" * 50)

Tokenizer: phi-2
tokens=[1212, 318, 257, 1332, 4731]
This is a test string
This is a test string
--------------------------------------------------
Tokenizer: phi-3
tokens=[1, 910, 338, 263, 1243, 1347]
<s> This is a test string
<s>Thisisateststring
--------------------------------------------------

Expected behavior

I expect that, even if I decode a single token at a time, the resulting string should contain spaces between tokens. As one can see, with Phi-2 model there are no problems, but for some reason Phi-3 does produce such a concatenated string.

ArthurZucker commented 3 weeks ago

cc @itazap

itazap commented 2 weeks ago

Hey @Andrei-Aksionov , thanks for the reproducer! It has to do with Phi-3 being based on the LlamaTokenizerFast and Phi-2 on CodeGen. LlamaTokenizerFast strips leading whitespace in order to manually add a prefix space on add_prefix_space. I'm looking into a fix now that handles this better!

huggingface / transformers