Using distributed or parallel set-up in script?: no
Who can help?
@ArthurZucker
Information
[ ] The official example scripts
[X] My own modified scripts
Tasks
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
The method word_ids() does only return a list of zeros instead of the correct word_ids.
sentence = "I love my cat"
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("google/Gemma-7b") #-version a0eac5b
encoded = tokenizer(sentence, return_tensors="pt")
print(encoded.word_ids())
# [None, 0, 0, 0, 0]
I tried several variations of configurations stated in the linked issues in #28881 , but for Gemma it doesn't change the result. The llama3 tokenizer outputs the correct values with this code.
Expected behavior
The output of word_ids should look like
[None, 0, 1, 2, 3]
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.41.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The method
word_ids()
does only return a list of zeros instead of the correct word_ids.I tried several variations of configurations stated in the linked issues in #28881 , but for Gemma it doesn't change the result. The llama3 tokenizer outputs the correct values with this code.
Expected behavior
The output of
word_ids
should look like[None, 0, 1, 2, 3]