Llama3 Insturct Tokenizers.Encoding.offsets is wrong

efsotr commented 1 month ago

System Info

transformers version: 4.41.2
Platform: Linux-5.4.0-193-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker and @itazap

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(t("are you ok?", add_special_tokens=False)[0].offsets)

output:

[(0, 0), (3, 3), (7, 7), (10, 10)]

Expected behavior

[(0, 3), (3, 7), (7, 10), (10, 11)]

efsotr commented 1 month ago

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(t("今天天气好", add_special_tokens=False)[0].offsets)

[(0, 2), (2, 3), (3, 4), (4, 5)]

If it encodes Chinese characters, it's output is correct.

itazap commented 1 month ago

Hey @efsotr 🤗 It appears that the issue is resolved on later transformers versions, please let me know if it persists!

transformer==4.44.2:

efsotr commented 1 month ago

@itazap In my server, run script:

from transformers import __version__, AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(__version__)
print(t("are you ok?", add_special_tokens=False)[0].offsets)

and the output is

4.44.2
[(0, 0), (3, 3), (7, 7), (10, 10)]

itazap commented 1 month ago

hmm can you please share your tokenizers version as well?

efsotr commented 1 month ago

import transformers
import tokenizers
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(transformers.__version__, tokenizers.__version__)
print(t("are you ok?", add_special_tokens=False)[0].offsets)

4.45.0 0.20.0
[(0, 0), (3, 3), (7, 7), (10, 10)]

itazap commented 1 month ago

Thank you! @ArthurZucker looks like a regression in tokenizers==0.20, I'm able to reproduce on several versions of transformers (incl 4.45) and only workes with tokenizers==0.19?

efsotr commented 1 month ago

It works with tokenizers==0.19.0 It doesn't with tokenizers==0.19.1

ArthurZucker commented 1 month ago

Mmm that's weird, https://github.com/huggingface/tokenizers/releases/tag/v0.19.1 only had the ignore_merges serialization, which is set to True for llama3 (try setting it to False!)

ArthurZucker commented 1 month ago

we'll see if we can do a patch

huggingface / transformers