Open Jaffacakes82 opened 1 week ago
This is a good issue. Thanks for bringing it to our attention. We will think about possibly offering a built-in solution for this, but as a workaround I wonder if you could not handle this using a custom policy:
To add to this: if you'd like to examine the values of the documented rate limit response headers, you can also do that without a custom policy by retrieving the raw response from the formal response wrapper and then checking its header values:
ClientResult<ThreadRun> runResult = client.CreateRun("assistantId", "threadId");
if (runResult.GetRawResponse().Headers.TryGetValue("x-ratelimit-limit-tokens", out string remainingTokenText))
{
// remainingTokenText has a value like: "150000"
}
ResultCollection<StreamingUpdate> streamingUpdates = client.CreateRunStreaming("threadId", "assistantId");
if (streamingUpdates.GetRawResponse().Headers.TryGetValue("x-ratelimit-reset-tokens", out string resetTimeText))
{
// resetTime has a value like: "6m0s"
}
As @KrzysztofCwalina mentioned, we'll look into providing a more direct and typed mechanism to retrieve this; ideally, when the keys are well-known, you shouldn't need to provide them explicitly like this.
Thanks @KrzysztofCwalina @trrwilson. I've implemented a basic solution using exponential backoff for now.
It might make sense to make this available in the Usage
property of the ThreadRun
.
@Jaffacakes82, be aware that the clients already implement a retry logic (with exponential backoff). The retries happen on any error, and apparently they delay is not enough for your scenarios. But, because of the retry logic , it's very important that when you add your custom policy to the client, you add it at "PerTry" position. Otherwise, the built in retry logic will kick in first and the client will still be retrying too early.
@KrzysztofCwalina, I believe this is the related System.ClientModel issue: https://github.com/Azure/azure-sdk-for-net/issues/44222
Without the built-in DelayStrategy
, I think retries are -- without the custom policy -- happening immediately, irrespective of hints like retry-after
headers.
Hi,
What is the recommended approach to leveraging the Assistants API via this SDK and appropriately handling breaches of the TPM rate limits?
I'm using RAG with both GPT-3.5-turbo and GPT-4o models, and given context tokens count towards the rate limit, I'm hitting these semi-frequently. How should I handle this?
Thanks!