Open vvolhejn opened 1 year ago
It's a little bit of both. The first step we do in tokenising is regex splitting, where we split on things like whitespace. The ensuing chunks are then fed to BPE. When tokenising natural language, these chunks usually end up very small.
BPE is superlinear in chunk length, so what you're seeing is approximately expected. But the implementation in tiktoken is optimised for short chunk lengths where constant factors dominate, so time complexity as chunk length scales is worse than what it could be. Thus far, this hasn't mattered in practice, so I haven't bothered branching the logic and adding a bunch of complexity for "aaaaaaaaaaaa...". Do you have a use case, or are you just curious? :-)
Thanks! The use case is that we're building an API that accepts text produced/consumed by GPT and we limit the length of the input to a certain number of tokens. Checking if the text is within the limit requires tokenizing it, and if tokenizing certain adversarial long strings takes a long time, this would be a way to DoS the server.
Isn't this also an issue that OpenAI itself has to deal with?
Thanks, that's a reasonable use case. Every token is at least one byte, so for now I'd recommend running a quick length check against your input first. At typical current maximum context lengths, this should provide a reasonable guarantee.
I work on the research side of things, but OpenAI is probably pretty happy for you to pay us for these tokens since CPU time is basically free. We have other mitigations against denial of service as well.
Every token is at least one byte, so for now I'd recommend running a quick length check against your input first.
But I guess the maximum token length is quite long, which is what I'd need to make statements like "for inputs longer than X bytes, they definitely have more than Y tokens so I can reject them"
The way we worked around this this in the meanwhile is that for extremely long prompts we require a certain number of spaces + other mitigations against DoS as you mention.
If you want a more precise (but not perfectly exact) way to check, you can do something like this, which parses progressively larger portions of the input. This is much more precise than bounding the length of the input, and in practice tends to be reasonably fast (tiktoken parsing of small strings is insanely fast, so if your input is only a few thousand code points then the code below should take low single-digit milliseconds).
@dataclass
class TokenCheck:
verification_passed: bool
num_tokens: int
def get_number_tokens(message: str, limit: int, processing_size=500) -> TokenCheck:
for end_index in range(0, len(message), processing_size):
if len(encoding.encode(message[:end_index], disallowed_special=())) > limit:
return TokenCheck(verification_passed=False, num_tokens=limit * 2)
num_tokens = len(encoding.encode(message, disallowed_special=()))
if num_tokens > limit:
return TokenCheck(verification_passed=False, num_tokens=limit * 2)
else:
return TokenCheck(verification_passed=True, num_tokens=num_tokens)
We're using this check for a game that requires reasonably exact token counts because pushing the limits of the LLM is part of the game.
For what it's worth, I just ran into a bug with tiktoken taking crazy amount of time which broke some code I was working on trying to do data analysis. Here's a plot I generated of runtime across token length:
The ultimate problem was I was trying to encode a dataset of output generated by GPT-J with tiktoken, and the model just fell into one of its loops where it just wrote out a bunch of whitespace over and over. And this caused nearly 100,000 whitespace tokens to be tokenized back-to-back which just stalled my tiktoken encoding forever.
I think it would probably be a good idea to implement a fix for this. While tiktoken was probably designed for encoding GPT-4 output, it's not the only thing people use it for. And having superlinear runtimes can cause other people a lot of pain in ways they wouldn't expect.
@qxcv Thanks! Looks like a good approximate workaround. Btw somebody sent Tensor Trust to me a few weeks ago and I really like the idea, although I haven't had time to play around with it yet. (We made https://gandalf.lakera.ai/, so we're working in the same space 🙂)
This can be solved by a different rank merge algorithm - I've fixed it in https://github.com/knuddelsgmbh/jtokkit (new version not released yet), might also contribute it back to tiktoken if it's an important usecase.
The blue is the current array based algo (doing linear minimum search) and the red is the new AVL tree based:
I pushed a PR here as well to tackle this exact scenario, see: https://github.com/openai/tiktoken/pull/239
I've noticed that Tiktoken is really slow for strings of repeated characters like
"a" * 100_000
. Interestingly, when you add spaces, like"a " * 50_000
, the performance is orders of magnitude better:Is this a bug or a fundamental property of BPE?
My Tiktoken version is 0.5.0.