openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.98k stars 816 forks source link

Understanding the intended behaviour of `_encode_bytes` #288

Open ashleyholman opened 5 months ago

ashleyholman commented 5 months ago

I'm working on a PR and would like to understand the reason for the behaviour of this_encode_bytes function when it hits an invalid UTF-8 sequence, to ensure I don't break this functionality.

https://github.com/openai/tiktoken/blob/1b9faf2779855124f05174adf1383e53689ed94b/src/lib.rs#L474-L495

How come only the first valid UTF-8 sequence is encoded with _encode_native (honouring regex splits) but all subsequent bytes are encoded as a single piece with byte_pair_encode? The Utf8Error returned by std::str::from_utf8 contains an error_len() property which gives the length of the invalid byte sequence. So couldn't byte_pair_encode be used only for the invalid sequence, and then use _encode_native again for any subsequent valid sequence? This can be implemented in a loop similar to the example loop in these Rust docs: https://doc.rust-lang.org/std/str/struct.Utf8Error.html

And more generally I'm looking to understand the current use cases that this is supporting and the reason it's implemented like it is. Thanks if you can share any further context.