I'm working on a PR and would like to understand the reason for the behaviour of this_encode_bytes function when it hits an invalid UTF-8 sequence, to ensure I don't break this functionality.
How come only the first valid UTF-8 sequence is encoded with _encode_native (honouring regex splits) but all subsequent bytes are encoded as a single piece with byte_pair_encode? The Utf8Error returned by std::str::from_utf8 contains an error_len() property which gives the length of the invalid byte sequence. So couldn't byte_pair_encode be used only for the invalid sequence, and then use _encode_native again for any subsequent valid sequence? This can be implemented in a loop similar to the example loop in these Rust docs: https://doc.rust-lang.org/std/str/struct.Utf8Error.html
And more generally I'm looking to understand the current use cases that this is supporting and the reason it's implemented like it is. Thanks if you can share any further context.
I'm working on a PR and would like to understand the reason for the behaviour of this
_encode_bytes
function when it hits an invalid UTF-8 sequence, to ensure I don't break this functionality.https://github.com/openai/tiktoken/blob/1b9faf2779855124f05174adf1383e53689ed94b/src/lib.rs#L474-L495
How come only the first valid UTF-8 sequence is encoded with
_encode_native
(honouring regex splits) but all subsequent bytes are encoded as a single piece withbyte_pair_encode
? TheUtf8Error
returned bystd::str::from_utf8
contains anerror_len()
property which gives the length of the invalid byte sequence. So couldn'tbyte_pair_encode
be used only for the invalid sequence, and then use_encode_native
again for any subsequent valid sequence? This can be implemented in a loop similar to the example loop in these Rust docs: https://doc.rust-lang.org/std/str/struct.Utf8Error.htmlAnd more generally I'm looking to understand the current use cases that this is supporting and the reason it's implemented like it is. Thanks if you can share any further context.