Closed LuoKaiGSW closed 1 month ago
Hey! I suppose you are using python
and can't see what's inside your tokenizer! #1542 should help you with this 🤗
Hey! I suppose you are using
python
and can't see what's inside your tokenizer! #1542 should help you with this 🤗
Thank you for your reply, but I didn't fully understand what you meant. After using tokenizer._tokenizer.model, I got a BPE object, but I didn't see the attribute I wanted in it - that is, the mapping from byte values to Unicode. Could you explain it a bit more clearly, please?
You cannot see any attributes because both __repr__
and __str__
are not implemented
You cannot see any attributes because both
__repr__
and__str__
are not implemented
So, is it impossible to read this mapping relationship from the fast tokenizer?
It is coming with the PR that I linked 😉
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Closing as we do have the capabilities merged now!
I have a model that uses BloomTokenizerFast, which does not have properties like byte_decoder and sp_model, so I can't figure out how it implements the mapping between byte values and Unicode characters. I've looked through the source code and only found that the pre_tokenize_str function can convert input text characters into Unicode characters, but I didn't see the mapping relationship it depends on. So I want to ask, how can I find this mapping relationship? Or is the mapping relationship used by the fast tokenizer the same as that of gpt2?