Closed lucasgadams closed 1 year ago
Sound like a good idea. I'll give it a go. Thanks
this will be reserved for the next major version
Note that in the OpenAI API case, you only will truly know how many tokens were consumed by a request after the request has succeeded.
Therefore, you can only guess how much capacity you will need beforehand, but there should be a way of notifying the bucket of how much capacity was actually consumed after each request.
Good point @dekked . Typically you can get a decent bound ahead of time as you can calculate the context side of the tokens, and then you will set a maximum for the completion/generation side when you send the requst
Actually I think what would be better for that case is using the decent bound guess for the first request only, and then using what you last got returned by OpenAI as of the exact number of tokens consumed. If you do use max_tokens
, even better as you can pretty much count everything beforehand.
For what it's worth, OpenAI has an example of a script for bulk analysis that handles errors and rate limits: https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py
My understanding of their rate limit implementation is that the num_tokens_consumed_from_request
function uses the API call's max_tokens
value as the consumption amount for a request, and the token capacity counter is decremented by that amount.
That should be an upper bound and so avoid rate limit errors, but an approach like @dekked suggests (using the exact # tokens consumed by the request) would be better at maximizing the available throughput.
In terms of API design for pyratelimiter, being able to apply "negative weights" to the buckets would support the use case when the estimate is too high. But to be fully general, it might be nice to be able to forcibly add usage to the buckets (like, try_acquire except it always succeeds). That way if your estimate was too low, you can let the limiter know without blocking.
pseudocode:
def make_request(req):
num_tokens_estimate = calc_num_tokens(req)
while True:
try:
limiter.try_acquire(num_tokens_estimate)
except BucketFullException:
time.sleep(1)
result = api_request.Create(req)
# positive if initial estimate was too low, negative if estimate was too high
usage_diff = result.tokens_used - num_tokens_estimate
limiter.force_add_usage(usage_diff)
return result
Resolved in the new major release (v3.0.0)
Cool library. I was thinking it would be useful if instead of 1 call = 1 request calculation for the rate limit, there could be an optional "weight". This would probably only be applicable for the try_acquire method. The use case is for calling openai, they have both a QPS rate limit and a "token" limit. Tokens are basically the number of words in text. So their rate limit is something like 3k queries per minute, and 250k tokens (words) per minute. I want to use this library to do both of those. If I could make each item in the bucket have a "weight" (which is number of tokens in the text in this example), then I think the rest of the library should work as is. Right now I think i can hack around it but would be a cool feature.