huggingface / transformers

đŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.31k stars 25.45k forks source link

GemmaTokenizerFast word_ids() returns only zeros #31437

Open Alienmaster opened 1 week ago

Alienmaster commented 1 week ago

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

The method word_ids() does only return a list of zeros instead of the correct word_ids.

sentence = "I love my cat"
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("google/Gemma-7b") #-version a0eac5b
encoded = tokenizer(sentence, return_tensors="pt")
print(encoded.word_ids())
# [None, 0, 0, 0, 0]

I tried several variations of configurations stated in the linked issues in #28881 , but for Gemma it doesn't change the result. The llama3 tokenizer outputs the correct values with this code.

Expected behavior

The output of word_ids should look like [None, 0, 1, 2, 3]

ArthurZucker commented 1 week ago

Hey! Will have a look thanks for reporting