[Bug]: Indexing hit rate limit error and keeps endless retrying

Bug Description

I got retrying error raised on running batch embeddings with AzuerOpenAI, and I wonder how do I make it wait for the limited time (60s per AOAI but 30s per raised error code) and continue to embed and build the index at where it was left automatically.

Retrying llama_index.embeddings.openai.base.aget_embeddings in 0.6633798960965387 seconds as it raised RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the Embeddings_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 32 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'

Version

0.10.67.post1

Steps to Reproduce

Just index a very large file or a bigger batch will cause this error

Relevant Logs/Tracbacks

Retrying llama_index.embeddings.openai.base.aget_embeddings in 0.6633798960965387 seconds as it raised RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the Embeddings_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 32 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'

I found some similar issues that might help you:

[Question]: RateLimitError: Error code: 429 (closed) - The solution was to ensure the OpenAI API key is active and the account has sufficient quota. Enabling the automatic recharge setting on the OpenAI account can prevent this issue ^[1].
Implement Rate Limit Handling in TogetherEmbedding() Class (closed) - Suggested implementing rate limit handling in the TogetherEmbedding() class to improve user experience ^[2].
Getting rate limit error from Azure OpenAI when using GPTSimpleVectorIndex (closed) - Suggested reducing the embeddings creation request rate by adjusting the batch size or implementing a delay between requests ^[3].

To handle the RateLimitError and ensure it waits for the specified time and continues from where it left off, you can use the create_retry_decorator function. Here’s an example:

import openai
from llama_index.embeddings.upstage.utils import create_retry_decorator

# Define the retry decorator with desired parameters
retry_decorator = create_retry_decorator(
    max_retries=5,  # Maximum number of retries
    random_exponential=True,  # Use random exponential backoff
    stop_after_delay_seconds=60,  # Stop retrying after 60 seconds
    min_seconds=4,  # Minimum wait time between retries
    max_seconds=10  # Maximum wait time between retries
)

@retry_decorator
def embed_batch(batch):
    # Your code to embed the batch using AzureOpenAI
    response = openai.Embedding.create(input=batch)
    return response

# Example usage
batches = [...]  # Your list of batches to embed
for batch in batches:
    try:
        result = embed_batch(batch)
        # Process the result and build the index
    except openai.RateLimitError as e:
        logger.warning(f"Rate limit error: {e}")
        # Handle the rate limit error if needed

This code sets up a retry mechanism that will handle RateLimitError by waiting for a random exponential time between retries, up to a maximum of 5 retries or 60 seconds ^[4].

Additionally, you can adjust the batch size when initializing the AzureOpenAIEmbedding class:

from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

embedding_llm = AzureOpenAIEmbedding(
    embed_batch_size=1,  # Adjust the batch size as needed
)

This approach ensures that you can continue embedding and building the index from where it left off, even if you encounter rate limit errors ^[5].

To continue talking to Dosu, mention @dosu.

I found some similar issues that might help you:

[Question]: RateLimitError: Error code: 429 (closed) - The solution was to ensure the OpenAI API key is active and the account has sufficient quota. Enabling the automatic recharge setting on the OpenAI account can prevent this issue [1].

Implement Rate Limit Handling in TogetherEmbedding() Class (closed) - Suggested implementing rate limit handling in the TogetherEmbedding() class to improve user experience [2].

Getting rate limit error from Azure OpenAI when using GPTSimpleVectorIndex (closed) - Suggested reducing the embeddings creation request rate by adjusting the batch size or implementing a delay between requests [3].

To handle the RateLimitError and ensure it waits for the specified time and continues from where it left off, you can use the create_retry_decorator function. Here’s an example:
import openai
from llama_index.embeddings.upstage.utils import create_retry_decorator

# Define the retry decorator with desired parameters
retry_decorator = create_retry_decorator(
    max_retries=5,  # Maximum number of retries
    random_exponential=True,  # Use random exponential backoff
    stop_after_delay_seconds=60,  # Stop retrying after 60 seconds
    min_seconds=4,  # Minimum wait time between retries
    max_seconds=10  # Maximum wait time between retries
)

@retry_decorator
def embed_batch(batch):
    # Your code to embed the batch using AzureOpenAI
    response = openai.Embedding.create(input=batch)
    return response

# Example usage
batches = [...]  # Your list of batches to embed
for batch in batches:
    try:
        result = embed_batch(batch)
        # Process the result and build the index
    except openai.RateLimitError as e:
        logger.warning(f"Rate limit error: {e}")
        # Handle the rate limit error if needed
This code sets up a retry mechanism that will handle RateLimitError by waiting for a random exponential time between retries, up to a maximum of 5 retries or 60 seconds [4].

Additionally, you can adjust the batch size when initializing the AzureOpenAIEmbedding class:
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

embedding_llm = AzureOpenAIEmbedding(
    embed_batch_size=1,  # Adjust the batch size as needed
)
This approach ensures that you can continue embedding and building the index from where it left off, even if you encounter rate limit errors [5].

To continue talking to Dosu, mention @dosu.

@dosu here is the code I use to run index, please advise how do I update my code to use the logic you provide?

async_index_client = AsyncSearchIndexClient(
    endpoint=search_service_endpoint,
    credential=search_service_credential,
    api_version=search_service_api_version
)

vector_store = AzureAISearchVectorStore(
    search_or_index_client=async_index_client,
    index_name=index_name,
    index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="chunk",
    embedding_field_key="embedding",
    embedding_dimensionality=EMBED_SIZE,
    metadata_string_field_key="metadata",
    doc_id_field_key="doc_id",
    language_analyzer="en.lucene",
    vector_algorithm_type="exhaustiveKnn",
)

for document in tqdm(documents, total=len(documents), desc="Processing files"): try: VectorStoreIndex.from_documents( [document], storage_context=storage_context, use_async=True )

To update your code to use the retry logic for handling RateLimitError with AzureOpenAI, you can follow these steps:

Import the necessary functions and modules:

import openai
from llama_index.llms.konko.utils import _create_retry_decorator

Create a retry decorator with the desired parameters:

retry_decorator = _create_retry_decorator(max_retries=5)

Apply the retry decorator to your function that makes the API call:

@retry_decorator
def embed_batch(batch):
    response = openai.Embedding.create(input=batch)
    return response

Update your code to use the retry logic:

async_index_client = AsyncSearchIndexClient(
    endpoint=search_service_endpoint,
    credential=search_service_credential,
    api_version=search_service_api_version
)

vector_store = AzureAISearchVectorStore(
    search_or_index_client=async_index_client,
    index_name=index_name,
    index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="chunk",
    embedding_field_key="embedding",
    embedding_dimensionality=EMBED_SIZE,
    metadata_string_field_key="metadata",
    doc_id_field_key="doc_id",
    language_analyzer="en.lucene",
    vector_algorithm_type="exhaustiveKnn",
)

for document in tqdm(documents, total=len(documents), desc="Processing files"):
    try:
        VectorStoreIndex.from_documents(
            [document],
            storage_context=storage_context,
            use_async=True
        )
    except openai.error.RateLimitError as e:
        logger.warning(f"Rate limit error: {e}")
        # Handle the rate limit error if needed

By following these steps, you will have integrated the retry logic to handle RateLimitError with AzureOpenAI in your code ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

@dosu

I am actually using Azure OpenAI instead of openai as below code snippet, and from your answer, the embed_batch function is not used at all.

from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

    embed_model = AzureOpenAIEmbedding(
        model=EMBED_MODEL,
        deployment_name=EMBED_DEPLOYMENT_NAME,
        api_key=AZURE_OPENAI_API_KEY,
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_version=AZURE_OPENAI_API_VERSION,
        embed_batch_size=5
    )

To handle RateLimitError using Azure OpenAI with the AzureOpenAIEmbedding class, you can add a retry mechanism directly in the methods that make API calls. Here’s how you can update your code:

Import necessary modules:

import time
from openai.error import RateLimitError
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

Define a retry mechanism:

class AzureOpenAIEmbeddingWithRetry(AzureOpenAIEmbedding):
    def _get_client(self) -> AzureOpenAI:
        if not self.reuse_client:
            return self._create_client_with_retries()

        if self._client is None:
            self._client = self._create_client_with_retries()
        return self._client

    def _get_aclient(self) -> AsyncAzureOpenAI:
        if not self.reuse_client:
            return self._create_aclient_with_retries()

        if self._aclient is None:
            self._aclient = self._create_aclient_with_retries()
        return self._aclient

    def _create_client_with_retries(self) -> AzureOpenAI:
        retries = 0
        while retries < self.max_retries:
            try:
                return AzureOpenAI(**self._get_credential_kwargs())
            except RateLimitError:
                retries += 1
                time.sleep(2 ** retries)  # Exponential backoff
        raise RateLimitError("Max retries exceeded for creating AzureOpenAI client")

    def _create_aclient_with_retries(self) -> AsyncAzureOpenAI:
        retries = 0
        while retries < self.max_retries:
            try:
                return AsyncAzureOpenAI(**self._get_credential_kwargs(is_async=True))
            except RateLimitError:
                retries += 1
                time.sleep(2 ** retries)  # Exponential backoff
        raise RateLimitError("Max retries exceeded for creating AsyncAzureOpenAI client")

Use the updated class in your code:

embed_model = AzureOpenAIEmbeddingWithRetry(
    model=EMBED_MODEL,
    deployment_name=EMBED_DEPLOYMENT_NAME,
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=AZURE_OPENAI_API_VERSION,
    embed_batch_size=5,
    max_retries=5  # Set the maximum number of retries
)

async_index_client = AsyncSearchIndexClient(
    endpoint=search_service_endpoint,
    credential=search_service_credential,
    api_version=search_service_api_version
)

vector_store = AzureAISearchVectorStore(
    search_or_index_client=async_index_client,
    index_name=index_name,
    index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="chunk",
    embedding_field_key="embedding",
    embedding_dimensionality=EMBED_SIZE,
    metadata_string_field_key="metadata",
    doc_id_field_key="doc_id",
    language_analyzer="en.lucene",
    vector_algorithm_type="exhaustiveKnn",
)

for document in tqdm(documents, total=len(documents), desc="Processing files"):
    try:
        VectorStoreIndex.from_documents(
            [document],
            storage_context=storage_context,
            use_async=True
        )
    except RateLimitError as e:
        logger.warning(f"Rate limit error: {e}")
        # Handle the rate limit error if needed

This code integrates a retry mechanism with exponential backoff to handle RateLimitError when creating the AzureOpenAI and AsyncAzureOpenAI clients, ensuring that your embedding process can continue from where it left off ^[1].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index