Closed efsotr closed 1 month ago
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(t("今天天气好", add_special_tokens=False)[0].offsets)
[(0, 2), (2, 3), (3, 4), (4, 5)]
If it encodes Chinese characters, it's output is correct.
Hey @efsotr 🤗 It appears that the issue is resolved on later transformers
versions, please let me know if it persists!
transformer==4.44.2
:
@itazap In my server, run script:
from transformers import __version__, AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(__version__)
print(t("are you ok?", add_special_tokens=False)[0].offsets)
and the output is
4.44.2
[(0, 0), (3, 3), (7, 7), (10, 10)]
hmm can you please share your tokenizers
version as well?
import transformers
import tokenizers
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(transformers.__version__, tokenizers.__version__)
print(t("are you ok?", add_special_tokens=False)[0].offsets)
4.45.0 0.20.0
[(0, 0), (3, 3), (7, 7), (10, 10)]
Thank you! @ArthurZucker looks like a regression in tokenizers==0.20
, I'm able to reproduce on several versions of transformers
(incl 4.45) and only workes with tokenizers==0.19
?
It works with tokenizers==0.19.0 It doesn't with tokenizers==0.19.1
Mmm that's weird, https://github.com/huggingface/tokenizers/releases/tag/v0.19.1 only had the ignore_merges
serialization, which is set to True
for llama3 (try setting it to False!)
we'll see if we can do a patch
System Info
transformers
version: 4.41.2Who can help?
@ArthurZucker and @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
output:
Expected behavior