Http 500 error when exceeding Azure OpenAI quota

roldengarm commented 7 months ago

Context / Scenario

I've deployed Kernel Memory as a service to an Azure App Service. I'm ingesting a large amount of data. These operations are continuously being retried by Azure AI as we've reached our token limit. When I try to ask a question using the Kernel Memory Web Client, I randomly get a http/500 error which does not have any additional information. As 99% of our App Insights logs are being sampled, I've only seen the underlying error being logged once. It was caused by the Azure AI token limit.

What happened?

Using the Kernel Memory Web Client, I try to ask a question:

var memory = new MemoryWebClient("<URL>");
var result = memory.AskAsync("What do you know about Roe v. Wade").Result;

Response: One or more errors occurred. (Response status code does not indicate success: 500 (Internal Server Error).)

When I keep retrying, at some point it works

Expectation:

Ingestion should not take up all available quota, the Ask web service should have priority.
The response should not be a HTTP/500 error, but e.g. HTTP/429.

Importance

I cannot use Kernel Memory

Platform, Language, Versions

Kernel Memory service: Updated the service 2 days ago (23 January 2024) Kernel Memory Webclient: 0.26.240121.1

Relevant log output

Azure.RequestFailedException: Requests to the Embeddings_Create Operation under Azure OpenAI API version 2023-12-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.
Status: 429 (Too Many Requests)
ErrorCode: 429

Content:
{"error":{"code":"429","message": "Requests to the Embeddings_Create Operation under Azure OpenAI API version 2023-12-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit."}}

Headers:
Retry-After: 1
x-ratelimit-reset-requests: REDACTED
x-ms-client-request-id: b8c73f4c-6b2d-4a2e-bc9f-81c5573eb520
apim-request-id: REDACTED
Strict-Transport-Security: REDACTED
X-Content-Type-Options: REDACTED
policy-id: REDACTED
x-ms-region: REDACTED
Date: Tue, 23 Jan 2024 03:41:06 GMT
Content-Length: 341
Content-Type: application/json

   at Azure.Core.HttpPipelineExtensions.ProcessMessageAsync(HttpPipeline pipeline, HttpMessage message, RequestContext requestContext, CancellationToken cancellationToken)
   at Azure.AI.OpenAI.OpenAIClient.GetEmbeddingsAsync(EmbeddingsOptions embeddingsOptions, CancellationToken cancellationToken)

Microsoft.SemanticKernel.HttpOperationException:
   at Microsoft.SemanticKernel.Connectors.OpenAI.ClientCore+<RunRequestAsync>d__42`1.MoveNext (Microsoft.SemanticKernel.Connectors.OpenAI, Version=1.1.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.SemanticKernel.Connectors.OpenAI.ClientCore+<GetEmbeddingsAsync>d__26.MoveNext (Microsoft.SemanticKernel.Connectors.OpenAI, Version=1.1.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.SemanticKernel.AI.Embeddings.TextEmbeddingGenerationExtensions+<GenerateEmbeddingAsync>d__0.MoveNext (Microsoft.KernelMemory.Abstractions, Version=0.26.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.KernelMemory.MemoryDb.AzureAISearch.AzureAISearchMemory+<GetSimilarListAsync>d__7.MoveNext (Microsoft.KernelMemory.MemoryDb.AzureAISearch, Version=0.26.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.ThrowForFailedGetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.KernelMemory.MemoryDb.AzureAISearch.AzureAISearchMemory+<GetSimilarListAsync>d__7.System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult (Microsoft.KernelMemory.MemoryDb.AzureAISearch, Version=0.26.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
   at Microsoft.KernelMemory.Search.SearchClient+<AskAsync>d__8.MoveNext (Microsoft.KernelMemory.Core, Version=0.26.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.KernelMemory.Search.SearchClient+<AskAsync>d__8.MoveNext (Microsoft.KernelMemory.Core, Version=0.26.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1+ConfiguredTaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Program+<>c+<<<Main>$>b__0_7>d.MoveNext (Microsoft.KernelMemory.ServiceAssembly, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null: D:\projects\poc\kernel-memory\service\Service\Program.cs:273)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Http.RequestDelegateFactory+<<TaskOfTToValueTaskOfObject>g__ExecuteAwaited|92_0>d`1.MoveNext (Microsoft.AspNetCore.Http.Extensions, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Threading.Tasks.ValueTask`1.get_Result (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.ValueTaskAwaiter`1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.KernelMemory.Service.HttpAuthEndpointFilter+<InvokeAsync>d__2.MoveNext (Microsoft.KernelMemory.ServiceAssembly, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null: D:\projects\poc\kernel-memory\service\Service\Auth\HttpAuthHandler.cs:37)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Threading.Tasks.ValueTask`1.get_Result (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.ValueTaskAwaiter`1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Http.RequestDelegateFactory+<<ExecuteValueTaskOfObject>g__ExecuteAwaited|129_0>d.MoveNext (Microsoft.AspNetCore.Http.Extensions, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Http.RequestDelegateFactory+<>c__DisplayClass102_2+<<HandleRequestBodyAndCompileRequestDelegateForJson>b__2>d.MoveNext (Microsoft.AspNetCore.Http.Extensions, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware+<<Invoke>g__AwaitRequestTask|7_0>d.MoveNext (Microsoft.AspNetCore.Routing, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware+<Invoke>d__6.MoveNext (Microsoft.AspNetCore.Authentication, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT`1+<ProcessRequestAsync>d__2.MoveNext (Microsoft.AspNetCore.Server.IIS, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
Inner exception Azure.RequestFailedException handled at Microsoft.SemanticKernel.Connectors.OpenAI.ClientCore+<RunRequestAsync>d__42`1.MoveNext:
   at Azure.Core.HttpPipelineExtensions+<ProcessMessageAsync>d__0.MoveNext (Azure.AI.OpenAI, Version=1.0.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Azure.AI.OpenAI.OpenAIClient+<GetEmbeddingsAsync>d__18.MoveNext (Azure.AI.OpenAI, Version=1.0.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.SemanticKernel.Connectors.OpenAI.ClientCore+<RunRequestAsync>d__42`1.MoveNext (Microsoft.SemanticKernel.Connectors.OpenAI, Version=1.1.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)

dluc commented 7 months ago

Thanks for including the logs. As you can see the error message includes the reason:

Azure OpenAI API version 2023-12-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.

We can change the exception thrown in that scenario, but the problem persists and is about the amount of requests exceeding the max quota available.

roldengarm commented 7 months ago

We can change the exception thrown in that scenario, but the problem persists and is about the amount of requests exceeding the max quota available.

Thanks for your prompt response. Yeah, returning a proper response code would be useful.

So, the main issue is that we're ingesting thousands of docs continuously hammering Azure AI. Is it possible to throttle application side, so that way it leaves some quota remaining for asking questions? I don't mind if ingestion takes a bit longer, the key requirement is that it responds quickly to asking questions.

E.g. it would be nice if we can configure KM to only ingest x at a time. We're using Azure Queues for ingestion. @dluc

dluc commented 7 months ago

When using the service with the underlying queues, the ingestion should continue, retrying. I haven't tested this specific scenario though. If the ingestion stops then it's a bug.

On the "ask" side though, your client will have to retry. We'll change the error code to 429 or similar, so you don't have to retry on 500.

E.g. it would be nice if we can configure KM to only ingest x at a time.

agreed, though we'd need someone to work on it, because of other priorities. Consider also that the service can be deployed on multiple machines, so "x at a time" would have to be orchestrated centrally.

For now you could slow down the ingestion, waiting for a document to be complete before sending the next.

KSemenenko commented 7 months ago

also this is can be related to https://github.com/microsoft/semantic-kernel/issues/4744 Im tryin to test retry logic

https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/KernelSyntaxExamples/Example08_RetryHandler.cs

dluc commented 3 months ago

Update: @alkampfergit is working on embedding batch generation, which will reduce the number of requests, see for example PR https://github.com/microsoft/kernel-memory/pull/526 and https://github.com/microsoft/kernel-memory/pull/531

alkampfergit commented 3 months ago

I have a PR opened and I'm modifying adding the setting maximum number of element in batch, since azure openai with ada have an hard limit on 16 elements maximum in a batch

roldengarm commented 3 months ago

Update: @alkampfergit is working on embedding batch generation, which will reduce the number of requests, see for example PR #526 and #531

That is awesome news @dluc . We're ingesting very large datasets with KM and based on previous calculations about 9 million records would take ~22 days, the main bottleneck seems to be the ADA generation.

I've checked both PR's and I can see one is merged, the other one pending. I guess we should wait until @alkampfergit has been merged that configures the max batch size to 16?

We're using ImportTextAsync to import text documents, and run 12 parallel threads. It waits until each document is ready. Do we have to change anything to benefit from these batching changes?

Sorry for all the questions, but the throughput has been one of our major challenges with Kernel Memory.

alkampfergit commented 3 months ago

Hi @roldengarm actually I suggest using the new embedding model, as far as I know it seems that they do no suffer the 16 element problem. I've custom ingestion pipeline where I constantly pass block of 50 elements without any problem (I did not checked if the underling client perform limit for me).

OpenAI api (not azure) have 2048 as limit :/

roldengarm commented 3 months ago

Hi @roldengarm actually I suggest using the new embedding model, as far as I know it seems that they do no suffer the 16 element problem. I've custom ingestion pipeline where I constantly pass block of 50 elements without any problem (I did not checked if the underling client perform limit for me).

OpenAI api (not azure) have 2048 as limit :/

Thanks! Which one exactly do you mean? See models here. Do you mean text-embedding-3-large?

alkampfergit commented 3 months ago

Large or small depends on how fast is your vector store, but remember that those models supports additional parameters to reduce number of dimensions.

roldengarm commented 3 months ago

Large or small depends on how fast is your vector store, but remember that those models supports additional parameters to reduce number of dimensions.

We're using Postgres as our data store, previously we used Azure AI Search. In both cases, it seems that the embedding generation is the bottleneck. When I didn't limit the number of documents being ingested in parallel, KM would continuously throw errors about token limits being exceeded in Azure AI. Now, we're limiting the number of documents being ingested at 12 at a time, which solved that issue.

However, I'd like to increase the speed if possible, either by changing the embeddings model, or by using the new batching feature, or...? I'm still unsure how.

We're using the standard configuration in regards to dimensions.

dluc commented 3 months ago

errors about token limits

might not be relevant, but consider also that there's a max size for each chunk of text. Each model can handle a different max amount. So there are always three factors to consider:

are the text chunks too big for the embedding model?
is the client sending too many requests per second to the LLM endpoint?
with batching: is the client sending too many chunks per batch?

roldengarm commented 3 months ago

are the text chunks too big for the embedding model?

In our case not, as they are ingested fine, it's just causing issues with too many ingestions happening in parallel.

is the client sending too many requests per second to the LLM endpoint?

We use Kernel Memory as a web service, and it's configured to use Azure Queues. We only ingest 12 in parallel and wait until it's finished.

with batching: is the client sending too many chunks per batch?

I'm still not sure how I can enable batching in Kernel Memory when running as a service, or does that happen automatically? @dluc

dluc commented 3 months ago

with batching: is the client sending too many chunks per batch?

I'm still not sure how I can enable batching in Kernel Memory when running as a service, or does that happen automatically? @dluc

the feature is not ready yet -- once ready it will be configurable -- still TBD because each service has different limits to consider

KSemenenko commented 3 months ago

Can we do this as configuration option?

dluc commented 2 months ago

Batch embedding generation ready and released. Thanks @alkampfergit https://github.com/microsoft/kernel-memory/pull/531 !

Quick notes:

batch support added to OpenAI and Azure OpenAI embedding generators
batch size configurable. Default for OpenAI is 100. Default for Azure OpenAI is 1 to support old deployments.
batch size can be changed at runtime via RequestContext, in case you want to try different values without changing the config and restarting the service.

dluc commented 2 weeks ago

Work left before closing:

reproducing 429
change KM code to surface 429s appropriately.

E.g. when calling KM service, if AI internally returns 429, KM web service should return 429 too (including some useful info)

microsoft / kernel-memory