openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.76k stars 801 forks source link

Any introduction about api `encode_with_unstable`? #137

Open fseasy opened 1 year ago

fseasy commented 1 year ago

https://github.com/openai/tiktoken/blob/095924e02c85617df6889698d94515f91666c7ea/src/lib.rs#L524 Hello, I'm reading the lib.rs code and found the encode_with_unstable api, tt donesn't seem to be used in the documentation? But it occupied so much in the lib.rs, and the comments in code don't explain Why and What. So maybe some extra explanation?

hauntsaninja commented 1 year ago

This is a great question. I have some nice internal documentation explaining what problem this is solving, I'll see if I can make a version of it that doesn't include internal-only details.

ashleyholman commented 4 months ago

Any update on this? I'm working on a PR for this repo and need to make sure I don't break encode_with_unstable. I think I get the main point that if you're splitting text arbitrarily, not necessarily aligned with the regex spits, the tokens at the boundaries where the split occurs might end up different than if the whole string were tokenized as one. But it would help to get some more backstory on the motivation for this and the use-cases that it's serving.