huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.74k stars 25.54k forks source link

[tokenizer] Inconsistent behavior in slow tokenizer and fast tokenizer #29159

Open Ki-Seki opened 4 months ago

Ki-Seki commented 4 months ago

System Info

Who can help?

@ArthurZucker and @younesbelkada

Information

Tasks

Reproduction

from transformers import AutoTokenizer

def answer_or_exception(tokenizer, id):
    print(f'<<<<<<{tokenizer.__class__}>>>>>>')
    try:
        print(f'"{tokenizer.decode([id])}"')
    except Exception as e:
        print(e)

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=False)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=True)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # correct

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=False)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=True)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # correct

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
sequence item 0: expected str instance, NoneType found
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
""
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
"给"
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
piece id is out of range.
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
"给"
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
""

Expected behavior

Consistent decode behavior in slow tokenizer and fast tokenizer when id exceeds vocab size. For example, instead of raise exceptions, the slow tokenizer output empty strings like the fast tokenizer does.

ArthurZucker commented 4 months ago

Hey! Thanks for opening an issue. Few things first. You are using a custom / local checkpoint with trust remote code.

Fast is not erroring out when you feed OOV, while slow is and it is indeed inconsistent. Would you like to open a PR for a fix? 🤗

Ki-Seki commented 4 months ago

Yes, I'll try that. Thank you for your reply!

hackpk commented 2 months ago

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

Ki-Seki commented 2 months ago

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

I'm OK with that. I have other things to do recently.😭

ArthurZucker commented 2 months ago

Sure 🤗