PreTrainedTokenizerFast `char_to_token` `token_to_char` not working as expected

yonigottesman commented 1 month ago

System Info

transformers version: 4.44.0
Platform: macOS-13.6.9-arm64-arm-64bit
Python version: 3.11.4
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.4.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "the quick brown fox jumps over the lazy dog"
out = tokenizer(text)
out.char_to_token(0)

This returns None for any char index not just 0

Also, token_to_char doesnt return expected results: out.token_to_chars(4) returns CharSpan(start=15, end=15) instead of CharSpan(start=15, end=19)

Expected behavior

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "the quick brown fox jumps over the lazy dog"
out = tokenizer(text)
out.char_to_token(0)

should return 1

out.token_to_chars(4) should return CharSpan(start=15, end=19)

ArthurZucker commented 1 month ago

I think this is related to huggingface/transformers#25082 and is more related to tokenizers than transformers

ArthurZucker commented 1 month ago

I don't have a fix, but it's a but indeed

yonigottesman commented 1 month ago

So should I open the issue in that repo? this is really needed for huggingface/transformers#30650

ArthurZucker commented 1 month ago

Yeah, it's basically the same as https://github.com/huggingface/tokenizers/issues/1553, since the offsets are wrong, the char to token that just uses them is also outputing wrong outputs. Let me transfer the issue!

tcleberg commented 1 week ago

Any progress on this one?

huggingface / tokenizers