huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 776 forks source link

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

Closed LuoKaiGSW closed 1 month ago

LuoKaiGSW commented 3 months ago

I have a model that uses BloomTokenizerFast, which does not have properties like byte_decoder and sp_model, so I can't figure out how it implements the mapping between byte values and Unicode characters. I've looked through the source code and only found that the pre_tokenize_str function can convert input text characters into Unicode characters, but I didn't see the mapping relationship it depends on. So I want to ask, how can I find this mapping relationship? Or is the mapping relationship used by the fast tokenizer the same as that of gpt2?

ArthurZucker commented 3 months ago

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

LuoKaiGSW commented 3 months ago

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

Thank you for your reply, but I didn't fully understand what you meant. After using tokenizer._tokenizer.model, I got a BPE object, but I didn't see the attribute I wanted in it - that is, the mapping from byte values to Unicode. Could you explain it a bit more clearly, please?

ArthurZucker commented 3 months ago

You cannot see any attributes because both __repr__ and __str__ are not implemented

LuoKaiGSW commented 3 months ago

You cannot see any attributes because both __repr__ and __str__ are not implemented

So, is it impossible to read this mapping relationship from the fast tokenizer?

ArthurZucker commented 3 months ago

It is coming with the PR that I linked 😉

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 1 month ago

Closing as we do have the capabilities merged now!