huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 776 forks source link

PreTrainedTokenizerFast `char_to_token` `token_to_char` not working as expected #1620

Open yonigottesman opened 1 month ago

yonigottesman commented 1 month ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "the quick brown fox jumps over the lazy dog"
out = tokenizer(text)
out.char_to_token(0)

This returns None for any char index not just 0

Also, token_to_char doesnt return expected results: out.token_to_chars(4) returns CharSpan(start=15, end=15) instead of CharSpan(start=15, end=19)

Expected behavior

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "the quick brown fox jumps over the lazy dog"
out = tokenizer(text)
out.char_to_token(0)

should return 1

out.token_to_chars(4) should return CharSpan(start=15, end=19)

ArthurZucker commented 1 month ago

I think this is related to huggingface/transformers#25082 and is more related to tokenizers than transformers

ArthurZucker commented 1 month ago

I don't have a fix, but it's a but indeed

yonigottesman commented 1 month ago

So should I open the issue in that repo? this is really needed for huggingface/transformers#30650

ArthurZucker commented 1 month ago

Yeah, it's basically the same as https://github.com/huggingface/tokenizers/issues/1553, since the offsets are wrong, the char to token that just uses them is also outputing wrong outputs. Let me transfer the issue!

tcleberg commented 1 week ago

Any progress on this one?