Open roldengarm opened 7 months ago
Thanks for including the logs. As you can see the error message includes the reason:
Azure OpenAI API version 2023-12-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.
We can change the exception thrown in that scenario, but the problem persists and is about the amount of requests exceeding the max quota available.
We can change the exception thrown in that scenario, but the problem persists and is about the amount of requests exceeding the max quota available.
Thanks for your prompt response. Yeah, returning a proper response code would be useful.
So, the main issue is that we're ingesting thousands of docs continuously hammering Azure AI. Is it possible to throttle application side, so that way it leaves some quota remaining for asking questions? I don't mind if ingestion takes a bit longer, the key requirement is that it responds quickly to asking questions.
E.g. it would be nice if we can configure KM to only ingest x at a time. We're using Azure Queues for ingestion. @dluc
When using the service with the underlying queues, the ingestion should continue, retrying. I haven't tested this specific scenario though. If the ingestion stops then it's a bug.
On the "ask" side though, your client will have to retry. We'll change the error code to 429 or similar, so you don't have to retry on 500.
E.g. it would be nice if we can configure KM to only ingest x at a time.
agreed, though we'd need someone to work on it, because of other priorities. Consider also that the service can be deployed on multiple machines, so "x at a time" would have to be orchestrated centrally.
For now you could slow down the ingestion, waiting for a document to be complete before sending the next.
also this is can be related to https://github.com/microsoft/semantic-kernel/issues/4744 Im tryin to test retry logic
Update: @alkampfergit is working on embedding batch generation, which will reduce the number of requests, see for example PR https://github.com/microsoft/kernel-memory/pull/526 and https://github.com/microsoft/kernel-memory/pull/531
I have a PR opened and I'm modifying adding the setting maximum number of element in batch, since azure openai with ada have an hard limit on 16 elements maximum in a batch
Update: @alkampfergit is working on embedding batch generation, which will reduce the number of requests, see for example PR #526 and #531
That is awesome news @dluc . We're ingesting very large datasets with KM and based on previous calculations about 9 million records would take ~22 days, the main bottleneck seems to be the ADA generation.
I've checked both PR's and I can see one is merged, the other one pending. I guess we should wait until @alkampfergit has been merged that configures the max batch size to 16?
We're using ImportTextAsync
to import text documents, and run 12 parallel threads. It waits until each document is ready. Do we have to change anything to benefit from these batching changes?
Sorry for all the questions, but the throughput has been one of our major challenges with Kernel Memory.
Hi @roldengarm actually I suggest using the new embedding model, as far as I know it seems that they do no suffer the 16 element problem. I've custom ingestion pipeline where I constantly pass block of 50 elements without any problem (I did not checked if the underling client perform limit for me).
OpenAI api (not azure) have 2048 as limit :/
Hi @roldengarm actually I suggest using the new embedding model, as far as I know it seems that they do no suffer the 16 element problem. I've custom ingestion pipeline where I constantly pass block of 50 elements without any problem (I did not checked if the underling client perform limit for me).
OpenAI api (not azure) have 2048 as limit :/
Thanks!
Which one exactly do you mean? See models here. Do you mean text-embedding-3-large
?
Large or small depends on how fast is your vector store, but remember that those models supports additional parameters to reduce number of dimensions.
Large or small depends on how fast is your vector store, but remember that those models supports additional parameters to reduce number of dimensions.
We're using Postgres as our data store, previously we used Azure AI Search. In both cases, it seems that the embedding generation is the bottleneck. When I didn't limit the number of documents being ingested in parallel, KM would continuously throw errors about token limits being exceeded in Azure AI. Now, we're limiting the number of documents being ingested at 12 at a time, which solved that issue.
However, I'd like to increase the speed if possible, either by changing the embeddings model, or by using the new batching feature, or...? I'm still unsure how.
We're using the standard configuration in regards to dimensions.
errors about token limits
might not be relevant, but consider also that there's a max size for each chunk of text. Each model can handle a different max amount. So there are always three factors to consider:
are the text chunks too big for the embedding model?
In our case not, as they are ingested fine, it's just causing issues with too many ingestions happening in parallel.
is the client sending too many requests per second to the LLM endpoint?
We use Kernel Memory as a web service, and it's configured to use Azure Queues. We only ingest 12 in parallel and wait until it's finished.
with batching: is the client sending too many chunks per batch?
I'm still not sure how I can enable batching in Kernel Memory when running as a service, or does that happen automatically? @dluc
with batching: is the client sending too many chunks per batch?
I'm still not sure how I can enable batching in Kernel Memory when running as a service, or does that happen automatically? @dluc
the feature is not ready yet -- once ready it will be configurable -- still TBD because each service has different limits to consider
Can we do this as configuration option?
Batch embedding generation ready and released. Thanks @alkampfergit https://github.com/microsoft/kernel-memory/pull/531 !
Quick notes:
Work left before closing:
E.g. when calling KM service, if AI internally returns 429, KM web service should return 429 too (including some useful info)
Context / Scenario
I've deployed Kernel Memory as a service to an Azure App Service. I'm ingesting a large amount of data. These operations are continuously being retried by Azure AI as we've reached our token limit. When I try to ask a question using the Kernel Memory Web Client, I randomly get a http/500 error which does not have any additional information. As 99% of our App Insights logs are being sampled, I've only seen the underlying error being logged once. It was caused by the Azure AI token limit.
What happened?
Using the Kernel Memory Web Client, I try to ask a question:
Response:
One or more errors occurred. (Response status code does not indicate success: 500 (Internal Server Error).)
When I keep retrying, at some point it works
Expectation:
Importance
I cannot use Kernel Memory
Platform, Language, Versions
Kernel Memory service: Updated the service 2 days ago (23 January 2024) Kernel Memory Webclient: 0.26.240121.1
Relevant log output