Closed ahmedmoorsy closed 10 months ago
In general, we do not guarantee that every Unicode codepoint is a single token (because there are like a million of them), nor do we guarantee that individual tokens are valid UTF-8. It's a "byte" pair encoding, after all. [encoding.decode_single_token_bytes(token) for token in encoded_message]
will show you the bytes of your tokens.
Please see https://github.com/openai/tiktoken#what-is-bpe-anyway and the tiktoken._educational
submodule for more questions about BPE.
Hello,
I am trying to use tiktoken to tokenize some texts that contain math symbols like
∩
,⊆
, `A⊇B. But, tiktoken failed to decode this to a single token.code:
I tried to use
encoding.decode()
method and it's working very well. But, it's gives me the full text and I need to have a list of decode tokens instead.Any help?