[tokenizer] Inconsistent behavior in slow tokenizer and fast tokenizer

Ki-Seki commented 4 months ago

System Info

transformers version: 4.35.2
Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
Python version: 3.8.18
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no need
Using distributed or parallel set-up in script?: no need

Who can help?

@ArthurZucker and @younesbelkada

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

def answer_or_exception(tokenizer, id):
    print(f'<<<<<<{tokenizer.__class__}>>>>>>')
    try:
        print(f'"{tokenizer.decode([id])}"')
    except Exception as e:
        print(e)

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=False)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=True)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # correct

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=False)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=True)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # correct

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
sequence item 0: expected str instance, NoneType found
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
""
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
"给"
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
piece id is out of range.
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
"给"
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
""

Expected behavior

Consistent decode behavior in slow tokenizer and fast tokenizer when id exceeds vocab size. For example, instead of raise exceptions, the slow tokenizer output empty strings like the fast tokenizer does.

ArthurZucker commented 4 months ago

Hey! Thanks for opening an issue. Few things first. You are using a custom / local checkpoint with trust remote code.

Fast is not erroring out when you feed OOV, while slow is and it is indeed inconsistent. Would you like to open a PR for a fix? 🤗

Ki-Seki commented 4 months ago

Yes, I'll try that. Thank you for your reply!

hackpk commented 2 months ago

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

Ki-Seki commented 2 months ago

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

I'm OK with that. I have other things to do recently.😭

ArthurZucker commented 2 months ago

Sure 🤗

huggingface / transformers