Open gmdl007 opened 4 days ago
To reduce the rate of doing the summary_index
query and avoid hitting the rate limit, you can implement a rate limiting decorator. Here is an example of how you can do this in Python:
import time
import threading
from functools import wraps
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
@rate_limit(allowed_per_second=1) # Adjust the rate as needed
def query_summary_index(query):
# Your existing query logic here
return summary_index.query(query)
# Example usage
response = query_summary_index("give me the summary of the xyz")
print(str(response))
This decorator ensures that the query_summary_index
function does not exceed the specified rate limit. Adjust the allowed_per_second
parameter to control the rate of queries.
Additionally, you can consider implementing a caching mechanism to store the results of frequently queried summaries and reuse them instead of querying the summary_index
every time. This can further reduce the number of queries and help avoid hitting the rate limit.
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_query_summary_index(query):
return query_summary_index(query)
# Example usage
response = cached_query_summary_index("give me the summary of the xyz")
print(str(response))
By combining rate limiting and caching, you can effectively manage the rate of queries to the summary_index
and avoid hitting the rate limit [1][2].
To continue talking to Dosu, mention @dosu.
Thanks! Will this work as well for :
query_engine = RouterQueryEngine( selector=PydanticSingleSelector.from_defaults(), query_engine_tools=[ list_tool, vector_tool, ], )
From: dosubot[bot] @.> Date: Wednesday, July 3, 2024 at 10:24 PM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
To reduce the rate of doing the summary_index query and avoid hitting the rate limit, you can implement a rate limiting decorator. Here is an example of how you can do this in Python:
import time
import threading
from functools import wraps
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
@rate_limit(allowed_per_second=1) # Adjust the rate as needed
def query_summary_index(query):
# Your existing query logic here
return summary_index.query(query)
response = query_summary_index("give me the summary of the xyz")
print(str(response))
This decorator ensures that the query_summary_index function does not exceed the specified rate limit. Adjust the allowed_per_second parameter to control the rate of queries.
Additionally, you can consider implementing a caching mechanism to store the results of frequently queried summaries and reuse them instead of querying the summary_index every time. This can further reduce the number of queries and help avoid hitting the rate limit.
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_query_summary_index(query):
return query_summary_index(query)
response = cached_query_summary_index("give me the summary of the xyz")
print(str(response))
By combining rate limiting and caching, you can effectively manage the rate of queries to the summary_index and avoid hitting the rate limit [1]https://github.com/run-llama/llama_index/blob/main/docs/docs/understanding/indexing/indexing.md[2]https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_05_25_23.json.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2207185199, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXENZQ3XTVF762WA4SLZKRMVZAVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGE4DKMJZHE. You are receiving this because you authored the thread.Message ID: @.***>
Yes, you can apply the rate limiting and caching decorators to the RouterQueryEngine
as well. You can wrap the query
method of the RouterQueryEngine
with these decorators to control the rate of queries and cache the results. Here is an example of how you can do this:
import time
import threading
from functools import wraps, lru_cache
from llama_index.core.query_engine import RouterQueryEngine
# Rate limiting decorator
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
# Caching decorator
@lru_cache(maxsize=100)
def cached_query(query_engine, query):
return query_engine.query(query)
# Apply decorators to the RouterQueryEngine's query method
@rate_limit(allowed_per_second=1) # Adjust the rate as needed
def rate_limited_query(query_engine, query):
return cached_query(query_engine, query)
# Example usage
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[
list_tool,
vector_tool,
],
)
response = rate_limited_query(query_engine, "give me the summary of the xyz")
print(str(response))
In this example:
rate_limit
decorator ensures that the query
method does not exceed the specified rate limit.lru_cache
decorator caches the results of the queries to avoid redundant computations.By combining these decorators, you can effectively manage the rate of queries to the RouterQueryEngine
and avoid hitting the rate limit.
To continue talking to Dosu, mention @dosu.
The problem is also, once it starts, it will not stop, I can not stop the code from sending many request towards the azure openAI server. Can I also implement number of request if using summary index?
From: dosubot[bot] @.> Date: Friday, July 5, 2024 at 11:15 AM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
Yes, you can apply the rate limiting and caching decorators to the RouterQueryEngine as well. You can wrap the query method of the RouterQueryEngine with these decorators to control the rate of queries and cache the results. Here is an example of how you can do this:
import time
import threading
from functools import wraps, lru_cache
from llama_index.core.query_engine import RouterQueryEngine
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
@lru_cache(maxsize=100)
def cached_query(query_engine, query):
return query_engine.query(query)
@rate_limit(allowed_per_second=1) # Adjust the rate as needed
def rate_limited_query(query_engine, query):
return cached_query(query_engine, query)
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[
list_tool,
vector_tool,
],
)
response = rate_limited_query(query_engine, "give me the summary of the xyz")
print(str(response))
In this example:
By combining these decorators, you can effectively manage the rate of queries to the RouterQueryEngine and avoid hitting the rate limit.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2210510904, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXD5JBTKE4HOYFUMIQLZKZP3ZAVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGUYTAOJQGQ. You are receiving this because you authored the thread.Message ID: @.***>
To stop your code from sending too many requests to the Azure OpenAI server when using summary_index
and implement a limit on the number of requests, you can introduce a delay between requests to ensure you stay within the rate limits. Additionally, you can use the GPTSimpleVectorIndex.from_documents()
method to manage the rate of embedding creation. Here is an example of how you can do this in Python:
import time
import openai
# Function to create embeddings with rate limiting
def create_embeddings_with_rate_limit(documents, rate_limit_per_minute):
interval = 60 / rate_limit_per_minute # Calculate the interval between requests
embeddings = []
for doc in documents:
try:
# Create embedding for the document
embedding = openai.Embedding.create(input=doc, model="text-embedding-ada-002")
embeddings.append(embedding)
# Wait for the interval to respect the rate limit
time.sleep(interval)
except openai.error.RateLimitError as e:
print(f"Rate limit error: {e}")
# Optionally, you can implement a backoff strategy here
time.sleep(interval)
return embeddings
# Example usage
documents = ["Document 1", "Document 2", "Document 3"]
rate_limit_per_minute = 60 # Set your rate limit here
embeddings = create_embeddings_with_rate_limit(documents, rate_limit_per_minute)
This code ensures that you do not exceed the rate limit by introducing a delay between each request. Adjust the rate_limit_per_minute
variable to match your specific rate limit.
Additionally, if you are using GPTSimpleVectorIndex.from_documents()
and encountering rate limit issues, you can manage the rate of requests by implementing a similar delay mechanism or by batching your requests to stay within the allowed limits. If you continue to face issues, consider reaching out to OpenAI support or upgrading your pricing tier to increase your rate limits [1][2].
To continue talking to Dosu, mention @dosu.
I am still getting the same error despite I am setting it to 0.1
@rate_limit(allowed_per_second=0.1) # Adjust the rate as needed def rate_limited_query(query_engine, query): return cached_query(query_engine, query)
HTTP Request: POST https://chat-ai.com/openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-07-01-preview "HTTP/1.1 200 OK" INFO:llama_index.core.query_engine.router_query_engine:Selecting query engine 0: The question is asking for a summary of the upgrade process on 8000 router, which is related to the documentation of 8000.. Selecting query engine 0: The question is asking for a summary of the upgrade process on 8000, which is related to the documentation of 8000.. huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
tokenizers
before the fork if possibleFrom: dosubot[bot] @.> Date: Friday, July 5, 2024 at 3:22 PM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
To stop your code from sending too many requests to the Azure OpenAI server when using summary_index and implement a limit on the number of requests, you can introduce a delay between requests to ensure you stay within the rate limits. Additionally, you can use the GPTSimpleVectorIndex.from_documents() method to manage the rate of embedding creation. Here is an example of how you can do this in Python:
import time
import openai
def create_embeddings_with_rate_limit(documents, rate_limit_per_minute):
interval = 60 / rate_limit_per_minute # Calculate the interval between requests
embeddings = []
for doc in documents:
try:
# Create embedding for the document
embedding = openai.Embedding.create(input=doc, model="text-embedding-ada-002")
embeddings.append(embedding)
# Wait for the interval to respect the rate limit
time.sleep(interval)
except openai.error.RateLimitError as e:
print(f"Rate limit error: {e}")
# Optionally, you can implement a backoff strategy here
time.sleep(interval)
return embeddings
documents = ["Document 1", "Document 2", "Document 3"]
rate_limit_per_minute = 60 # Set your rate limit here
embeddings = create_embeddings_with_rate_limit(documents, rate_limit_per_minute)
This code ensures that you do not exceed the rate limit by introducing a delay between each request. Adjust the rate_limit_per_minute variable to match your specific rate limit.
Additionally, if you are using GPTSimpleVectorIndex.from_documents() and encountering rate limit issues, you can manage the rate of requests by implementing a similar delay mechanism or by batching your requests to stay within the allowed limits. If you continue to face issues, consider reaching out to OpenAI support or upgrading your pricing tier to increase your rate limits [1]https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_05_25_23.json[2]https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_06_02_23.json.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2210871617, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXGCCSLTTZFBBBZCJTTZK2M2NAVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQHA3TCNRRG4. You are receiving this because you authored the thread.Message ID: @.***>
To implement a mechanism to limit the number of requests when using summary_index
and ensure that the code stops sending requests once the limit is reached, you can introduce a counter to track the number of requests and a limit variable to define the maximum number of requests allowed. Here is an example of how you can modify the existing code to include this mechanism:
import time
import threading
from functools import wraps
# Rate limiting decorator
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
class LimitedRequestRouterQueryEngine:
def __init__(self, query_engine, request_limit):
self.query_engine = query_engine
self.request_limit = request_limit
self.request_count = 0
@rate_limit(allowed_per_second=0.1) # Adjust the rate as needed
def query(self, query):
if self.request_count >= self.request_limit:
print("Request limit reached. No more requests will be sent.")
return None
self.request_count += 1
return self.query_engine.query(query)
# Example usage
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[
list_tool,
vector_tool,
],
)
limited_query_engine = LimitedRequestRouterQueryEngine(query_engine, request_limit=5)
response = limited_query_engine.query("give me the summary of the xyz")
print(str(response))
In this example:
rate_limit
decorator ensures that the query
method does not exceed the specified rate limit.LimitedRequestRouterQueryEngine
class wraps the RouterQueryEngine
and adds a request counter to limit the number of requests.query
method checks if the request limit has been reached before making a query.By combining these mechanisms, you can effectively manage the rate and number of queries to the RouterQueryEngine
and avoid hitting the rate limit [1].
To continue talking to Dosu, mention @dosu.
Also getting this response from OpenAI
Error code: 400 - {'error': {'message': "This model's maximum context length is 4096 tokens. However, you requested 4172 tokens (3660 in the messages, 512 in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}, '
From: dosubot[bot] @.> Date: Saturday, July 6, 2024 at 10:19 PM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
To implement a mechanism to limit the number of requests when using summary_index and ensure that the code stops sending requests once the limit is reached, you can introduce a counter to track the number of requests and a limit variable to define the maximum number of requests allowed. Here is an example of how you can modify the existing code to include this mechanism:
import time
import threading
from functools import wraps
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
class LimitedRequestRouterQueryEngine:
def __init__(self, query_engine, request_limit):
self.query_engine = query_engine
self.request_limit = request_limit
self.request_count = 0
@rate_limit(allowed_per_second=0.1) # Adjust the rate as needed
def query(self, query):
if self.request_count >= self.request_limit:
print("Request limit reached. No more requests will be sent.")
return None
self.request_count += 1
return self.query_engine.query(query)
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[
list_tool,
vector_tool,
],
)
limited_query_engine = LimitedRequestRouterQueryEngine(query_engine, request_limit=5)
response = limited_query_engine.query("give me the summary of the xyz")
print(str(response))
In this example:
By combining these mechanisms, you can effectively manage the rate and number of queries to the RouterQueryEngine and avoid hitting the rate limit [1]https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/indices/llama-index-indices-managed-vectara/llama_index/indices/managed/vectara/retriever.py.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2211952429, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXEJOKLRQQMB3T6KESTZLBGOLAVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHE2TENBSHE. You are receiving this because you authored the thread.Message ID: @.***>
To handle the "context length exceeded" error from OpenAI, you need to reduce the length of the messages or the completion. Here are some steps you can take:
max_tokens
: Limit the number of tokens generated in each response.max_retries
parameter to handle transient errors by retrying the request.Here is an example of how you can configure these settings in the OpenAI
class:
class OpenAI(LLM):
model: str = Field(
default=DEFAULT_OPENAI_MODEL, description="The OpenAI model to use."
)
temperature: float = Field(
default=DEFAULT_TEMPERATURE,
description="The temperature to use during generation.",
gte=0.0,
lte=1.0,
)
max_tokens: Optional[int] = Field(
description="The maximum number of tokens to generate.",
gt=0,
)
additional_kwargs: Dict[str, Any] = Field(
default_factory=dict, description="Additional kwargs for the OpenAI API."
)
max_retries: int = Field(
default=3,
description="The maximum number of API retries.",
gte=0,
)
timeout: float = Field(
default=60.0,
description="The timeout, in seconds, for API requests.",
gte=0,
)
default_headers: Dict[str, str] = Field(
default=None, description="The default headers for API requests."
)
reuse_client: bool = Field(
default=True,
description=(
"Reuse the OpenAI client between requests. When doing anything with large "
"volumes of async API calls, setting this to false can improve stability."
),
)
api_key: str = Field(default=None, description="The OpenAI API key.", exclude=True)
api_base: str = Field(description="The base URL for OpenAI API.")
api_version: str = Field(description="The API version for OpenAI API.")
_client: Optional[SyncOpenAI] = PrivateAttr()
_aclient: Optional[AsyncOpenAI] = PrivateAttr()
_http_client: Optional[httpx.Client] = PrivateAttr()
def __init__(
self,
model: str = DEFAULT_OPENAI_MODEL,
temperature: float = DEFAULT_TEMPERATURE,
max_tokens: Optional[int] = None,
additional_kwargs: Optional[Dict[str, Any]] = None,
max_retries: int = 3,
timeout: float = 60.0,
reuse_client: bool = True,
api_key: Optional[str] = None,
api_base: Optional[str] = None,
api_version: Optional[str] = None,
callback_manager: Optional[CallbackManager] = None,
default_headers: Optional[Dict[str, str]] = None,
http_client: Optional[httpx.Client] = None,
# base class
system_prompt: Optional[str] = None,
messages_to_prompt: Optional[Callable[[Sequence[ChatMessage]], str]] = None,
completion_to_prompt: Optional[Callable[[str], str]] = None,
pydantic_program_mode: PydanticProgramMode = PydanticProgramMode.DEFAULT,
output_parser: Optional[BaseOutputParser] = None,
**kwargs: Any,
) -> None:
additional_kwargs = additional_kwargs or {}
api_key, api_base, api_version = resolve_openai_credentials(
api_key=api_key,
api_base=api_base,
api_version=api_version,
)
super().__init__(
model=model,
temperature=temperature,
max_tokens=max_tokens,
additional_kwargs=additional_kwargs,
max_retries=max_retries,
callback_manager=callback_manager,
api_key=api_key,
api_version=api_version,
api_base=api_base,
timeout=timeout,
reuse_client=reuse_client,
default_headers=default_headers,
system_prompt=system_prompt,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
pydantic_program_mode=pydantic_program_mode,
output_parser=output_parser,
**kwargs,
)
self._client = None
self._aclient = None
self._http_client = http_client
def _get_client(self) -> SyncOpenAI:
if not self.reuse_client:
return SyncOpenAI(**self._get_credential_kwargs())
if self._client is None:
self._client = SyncOpenAI(**self._get_credential_kwargs())
return self._client
def _get_aclient(self) -> AsyncOpenAI:
if not self.reuse_client:
return AsyncOpenAI(**self._get_credential_kwargs())
if self._aclient is None:
self._aclient = AsyncOpenAI(**self._get_credential_kwargs())
return self._aclient
def _get_model_name(self) -> str:
model_name = self.model
if "ft-" in model_name: # legacy fine-tuning
model_name = model_name.split(":")[0]
elif model_name.startswith("ft:"):
model_name = model_name.split(":")[1]
return model_name
def _is_azure_client(self) -> bool:
return isinstance(self._get_client(), AzureOpenAI)
@classmethod
def class_name(cls) -> str:
return "openai_llm"
@property
def _tokenizer(self) -> Optional[Tokenizer]:
"""
Get a tokenizer for this model, or None if a tokenizing method is unknown.
OpenAI can do this using the tiktoken package, subclasses may not have
this convenience.
"""
return tiktoken.encoding_for_model(self._get_model_name())
@property
def metadata(self) -> LLMMetadata:
return LLMMetadata(
context_window=openai_modelname_to_contextsize(self._get_model_name()),
num_output=self.max_tokens or -1,
is_chat_model=is_chat_model(model=self._get_model_name()),
is_function_calling_model=is_function_calling_model(
model=self._get_model_name()
),
model_name=self.model,
)
By setting max_tokens
, you can control the length of the generated responses. Additionally, you can handle retries and timeouts to manage the rate of queries effectively. Implementing a rate-limiting mechanism will help you stay within the allowed number of requests per minute [1].
To continue talking to Dosu, mention @dosu.
What is this log say ?
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.70it/s]
I still think it is sending request too fast
From: dosubot[bot] @.> Date: Saturday, July 6, 2024 at 10:32 PM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
To handle the "context length exceeded" error from OpenAI, you need to reduce the length of the messages or the completion. Here are some steps you can take:
Here is an example of how you can configure these settings in the OpenAI class:
class OpenAI(LLM):
model: str = Field(
default=DEFAULT_OPENAI_MODEL, description="The OpenAI model to use."
)
temperature: float = Field(
default=DEFAULT_TEMPERATURE,
description="The temperature to use during generation.",
gte=0.0,
lte=1.0,
)
max_tokens: Optional[int] = Field(
description="The maximum number of tokens to generate.",
gt=0,
)
additional_kwargs: Dict[str, Any] = Field(
default_factory=dict, description="Additional kwargs for the OpenAI API."
)
max_retries: int = Field(
default=3,
description="The maximum number of API retries.",
gte=0,
)
timeout: float = Field(
default=60.0,
description="The timeout, in seconds, for API requests.",
gte=0,
)
default_headers: Dict[str, str] = Field(
default=None, description="The default headers for API requests."
)
reuse_client: bool = Field(
default=True,
description=(
"Reuse the OpenAI client between requests. When doing anything with large "
"volumes of async API calls, setting this to false can improve stability."
),
)
api_key: str = Field(default=None, description="The OpenAI API key.", exclude=True)
api_base: str = Field(description="The base URL for OpenAI API.")
api_version: str = Field(description="The API version for OpenAI API.")
_client: Optional[SyncOpenAI] = PrivateAttr()
_aclient: Optional[AsyncOpenAI] = PrivateAttr()
_http_client: Optional[httpx.Client] = PrivateAttr()
def __init__(
self,
model: str = DEFAULT_OPENAI_MODEL,
temperature: float = DEFAULT_TEMPERATURE,
max_tokens: Optional[int] = None,
additional_kwargs: Optional[Dict[str, Any]] = None,
max_retries: int = 3,
timeout: float = 60.0,
reuse_client: bool = True,
api_key: Optional[str] = None,
api_base: Optional[str] = None,
api_version: Optional[str] = None,
callback_manager: Optional[CallbackManager] = None,
default_headers: Optional[Dict[str, str]] = None,
http_client: Optional[httpx.Client] = None,
# base class
system_prompt: Optional[str] = None,
messages_to_prompt: Optional[Callable[[Sequence[ChatMessage]], str]] = None,
completion_to_prompt: Optional[Callable[[str], str]] = None,
pydantic_program_mode: PydanticProgramMode = PydanticProgramMode.DEFAULT,
output_parser: Optional[BaseOutputParser] = None,
**kwargs: Any,
) -> None:
additional_kwargs = additional_kwargs or {}
api_key, api_base, api_version = resolve_openai_credentials(
api_key=api_key,
api_base=api_base,
api_version=api_version,
)
super().__init__(
model=model,
temperature=temperature,
max_tokens=max_tokens,
additional_kwargs=additional_kwargs,
max_retries=max_retries,
callback_manager=callback_manager,
api_key=api_key,
api_version=api_version,
api_base=api_base,
timeout=timeout,
reuse_client=reuse_client,
default_headers=default_headers,
system_prompt=system_prompt,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
pydantic_program_mode=pydantic_program_mode,
output_parser=output_parser,
**kwargs,
)
self._client = None
self._aclient = None
self._http_client = http_client
def _get_client(self) -> SyncOpenAI:
if not self.reuse_client:
return SyncOpenAI(**self._get_credential_kwargs())
if self._client is None:
self._client = SyncOpenAI(**self._get_credential_kwargs())
return self._client
def _get_aclient(self) -> AsyncOpenAI:
if not self.reuse_client:
return AsyncOpenAI(**self._get_credential_kwargs())
if self._aclient is None:
self._aclient = AsyncOpenAI(**self._get_credential_kwargs())
return self._aclient
def _get_model_name(self) -> str:
model_name = self.model
if "ft-" in model_name: # legacy fine-tuning
model_name = model_name.split(":")[0]
elif model_name.startswith("ft:"):
model_name = model_name.split(":")[1]
return model_name
def _is_azure_client(self) -> bool:
return isinstance(self._get_client(), AzureOpenAI)
@classmethod
def class_name(cls) -> str:
return "openai_llm"
@property
def _tokenizer(self) -> Optional[Tokenizer]:
"""
Get a tokenizer for this model, or None if a tokenizing method is unknown.
OpenAI can do this using the tiktoken package, subclasses may not have
this convenience.
"""
return tiktoken.encoding_for_model(self._get_model_name())
@property
def metadata(self) -> LLMMetadata:
return LLMMetadata(
context_window=openai_modelname_to_contextsize(self._get_model_name()),
num_output=self.max_tokens or -1,
is_chat_model=is_chat_model(model=self._get_model_name()),
is_function_calling_model=is_function_calling_model(
model=self._get_model_name()
),
model_name=self.model,
)
By setting max_tokens, you can control the length of the generated responses. Additionally, you can handle retries and timeouts to manage the rate of queries effectively. Implementing a rate-limiting mechanism will help you stay within the allowed number of requests per minute [1]https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/openai.py.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2211957552, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXEK2CVYEFHYMYWLGLDZLBH33AVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHE2TONJVGI. You are receiving this because you authored the thread.Message ID: @.***>
The log message Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.70it/s]
indicates that a batch process has completed successfully, processing one batch at a rate of 3.70 iterations per second. This suggests that the process is running quite fast.
To ensure that your code sends requests at a slower rate and handles the "context length exceeded" error effectively, you can implement rate limiting and context length checks. Here is an example in Python:
time.sleep
function to introduce delays between requests.Here is a sample code snippet:
import time
# Function to send a request
def send_request(data):
# Simulate sending a request
print(f"Sending request with data: {data}")
# Function to handle context length
def handle_context_length(context, max_length):
if len(context) > max_length:
print(f"Context length exceeded: {len(context)} > {max_length}")
context = context[:max_length]
return context
# Example usage
max_context_length = 4096
requests = ["short request", "a" * 5000, "another short request"]
for request in requests:
# Handle context length
request = handle_context_length(request, max_context_length)
# Send the request
send_request(request)
# Rate limiting: wait for 1 second before sending the next request
time.sleep(1)
In this example:
handle_context_length
function checks if the context length exceeds the maximum allowed length and truncates it if necessary.time.sleep(1)
function introduces a 1-second delay between requests to ensure they are sent at a slower rate. Adjust the sleep duration as needed.Additionally, if you are using LlamaIndex, note that it does not have built-in protection for rate limits, but it does have some retry mechanisms for unstable connections. This might be something to consider adding, especially for services like OpenAI [1][2].
To continue talking to Dosu, mention @dosu.
Can you check the whole code, where this should be inserted:
import time import threading from functools import wraps
def rate_limit(*, allowed_per_second: int): max_period = 1.0 / allowed_per_second last_call = [time.perf_counter()] lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
class LimitedRequestRouterQueryEngine: def init(self, query_engine, request_limit): self.query_engine = query_engine self.request_limit = request_limit self.request_count = 0
@rate_limit(allowed_per_second=0.0000001) # Adjust the rate as needed
def query(self, query):
if self.request_count >= self.request_limit:
print("Request limit reached. No more requests will be sent.")
return None
self.request_count += 1
return self.query_engine.query(query)
limited_query_engine = LimitedRequestRouterQueryEngine(query_engine, request_limit=3)
response = rate_limited_query(query_engine, "summarize the xyz configuration steps")
From: dosubot[bot] @.> Date: Sunday, July 7, 2024 at 6:48 PM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
The log message Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.70it/s] indicates that a batch process has completed successfully, processing one batch at a rate of 3.70 iterations per second. This suggests that the process is running quite fast.
To ensure that your code sends requests at a slower rate and handles the "context length exceeded" error effectively, you can implement rate limiting and context length checks. Here is an example in Python:
Rate Limiting and Context Length Handling
Here is a sample code snippet:
import time
def send_request(data):
# Simulate sending a request
print(f"Sending request with data: {data}")
def handle_context_length(context, max_length):
if len(context) > max_length:
print(f"Context length exceeded: {len(context)} > {max_length}")
context = context[:max_length]
return context
max_context_length = 4096
requests = ["short request", "a" * 5000, "another short request"]
for request in requests:
# Handle context length
request = handle_context_length(request, max_context_length)
# Send the request
send_request(request)
# Rate limiting: wait for 1 second before sending the next request
time.sleep(1)
In this example:
Additionally, if you are using LlamaIndex, note that it does not have built-in protection for rate limits, but it does have some retry mechanisms for unstable connections. This might be something to consider adding, especially for services like OpenAI [1]https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_05_25_23.json[2]https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_06_02_23.json.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2212506020, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXGTL5A6OINY7MCUXFLZLFWN7AVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGUYDMMBSGA. You are receiving this because you authored the thread.Message ID: @.***>
To integrate the rate limiting and context length handling into your existing code, you need to ensure that the LimitedRequestRouterQueryEngine
class is correctly instantiated and used. Here is the complete code with the necessary modifications:
import time
import threading
from functools import wraps
# Rate limiting decorator
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
# Function to handle context length
def handle_context_length(context, max_length):
if len(context) > max_length:
print(f"Context length exceeded: {len(context)} > {max_length}")
context = context[:max_length]
return context
class LimitedRequestRouterQueryEngine:
def __init__(self, query_engine, request_limit, max_context_length):
self.query_engine = query_engine
self.request_limit = request_limit
self.request_count = 0
self.max_context_length = max_context_length
@rate_limit(allowed_per_second=0.1) # Adjust the rate as needed
def query(self, query):
if self.request_count >= self.request_limit:
print("Request limit reached. No more requests will be sent.")
return None
self.request_count += 1
query = handle_context_length(query, self.max_context_length)
return self.query_engine.query(query)
# Example usage
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector
from llama_index.core.tools import QueryEngineTool
# Define your query engines
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
list_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
vector_query_engine = vector_index.as_query_engine()
list_tool = QueryEngineTool.from_defaults(
query_engine=list_query_engine,
description="Useful for summarization questions related to 8k on its documentations",
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description="Useful for retrieving specific context from 8k on its documentations.",
)
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[list_tool, vector_tool],
)
# Instantiate the LimitedRequestRouterQueryEngine with a request limit and max context length
limited_query_engine = LimitedRequestRouterQueryEngine(query_engine, request_limit=3, max_context_length=4096)
# Use the limited query engine
response = limited_query_engine.query("summarize the xyz configuration steps")
print(str(response))
rate_limit
decorator ensures that the query
method does not exceed the specified rate limit.handle_context_length
function checks if the context length exceeds the maximum allowed length and truncates it if necessary.LimitedRequestRouterQueryEngine
class wraps the RouterQueryEngine
and adds a request counter to limit the number of requests.By combining these mechanisms, you can effectively manage the rate and number of queries to the RouterQueryEngine
and avoid hitting the rate limit [1][2].
To continue talking to Dosu, mention @dosu.
I think we should implement something like this:
from openai.error import RateLimitError import backoff
@backoff.on_exception(backoff.expo, RateLimitError) def completions_with_backoff(kwargs): response = openai.Completion.create(kwargs) return response
All of the code before, did not help
From: dosubot[bot] @.> Date: Sunday, July 7, 2024 at 7:05 PM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
To integrate the rate limiting and context length handling into your existing code, you need to ensure that the LimitedRequestRouterQueryEngine class is correctly instantiated and used. Here is the complete code with the necessary modifications:
import time
import threading
from functools import wraps
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
def handle_context_length(context, max_length):
if len(context) > max_length:
print(f"Context length exceeded: {len(context)} > {max_length}")
context = context[:max_length]
return context
class LimitedRequestRouterQueryEngine:
def __init__(self, query_engine, request_limit, max_context_length):
self.query_engine = query_engine
self.request_limit = request_limit
self.request_count = 0
self.max_context_length = max_context_length
@rate_limit(allowed_per_second=0.1) # Adjust the rate as needed
def query(self, query):
if self.request_count >= self.request_limit:
print("Request limit reached. No more requests will be sent.")
return None
self.request_count += 1
query = handle_context_length(query, self.max_context_length)
return self.query_engine.query(query)
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector
from llama_index.core.tools import QueryEngineTool
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
list_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
vector_query_engine = vector_index.as_query_engine()
list_tool = QueryEngineTool.from_defaults(
query_engine=list_query_engine,
description="Useful for summarization questions related to 8k on its documentations",
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description="Useful for retrieving specific context from 8k on its documentations.",
)
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[list_tool, vector_tool],
)
limited_query_engine = LimitedRequestRouterQueryEngine(query_engine, request_limit=3, max_context_length=4096)
response = limited_query_engine.query("summarize the xyz configuration steps")
print(str(response))
Key Points:
By combining these mechanisms, you can effectively manage the rate and number of queries to the RouterQueryEngine and avoid hitting the rate limit [1]https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-stackoverflow/llama_index/readers/stackoverflow/base.py[2]https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-friendli/llama_index/llms/friendli/utils.py.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2212510541, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXHQGPIVFCAP4JXVNVLZLFYMRAVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGUYTANJUGE. You are receiving this because you authored the thread.Message ID: @.***>
To implement a backoff strategy using the backoff
library for handling rate limit errors in your existing code, you can modify the LimitedRequestRouterQueryEngine
class to include the backoff mechanism. Here is the updated code:
import time
import threading
from functools import wraps
import backoff
from openai.error import RateLimitError
# Rate limiting decorator
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
# Function to handle context length
def handle_context_length(context, max_length):
if len(context) > max_length:
print(f"Context length exceeded: {len(context)} > {max_length}")
context = context[:max_length]
return context
class LimitedRequestRouterQueryEngine:
def __init__(self, query_engine, request_limit, max_context_length):
self.query_engine = query_engine
self.request_limit = request_limit
self.request_count = 0
self.max_context_length = max_context_length
@rate_limit(allowed_per_second=0.1) # Adjust the rate as needed
@backoff.on_exception(backoff.expo, RateLimitError, max_tries=5)
def query(self, query):
if self.request_count >= self.request_limit:
print("Request limit reached. No more requests will be sent.")
return None
self.request_count += 1
query = handle_context_length(query, self.max_context_length)
return self.query_engine.query(query)
# Example usage
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector
from llama_index.core.tools import QueryEngineTool
# Define your query engines
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
list_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
vector_query_engine = vector_index.as_query_engine()
list_tool = QueryEngineTool.from_defaults(
query_engine=list_query_engine,
description="Useful for summarization questions related to 8k on its documentations",
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description="Useful for retrieving specific context from 8k on its documentations.",
)
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[list_tool, vector_tool],
)
# Instantiate the LimitedRequestRouterQueryEngine with a request limit and max context length
limited_query_engine = LimitedRequestRouterQueryEngine(query_engine, request_limit=3, max_context_length=4096)
# Use the limited query engine
response = limited_query_engine.query("summarize the xyz configuration steps")
print(str(response))
rate_limit
decorator ensures that the query
method does not exceed the specified rate limit.handle_context_length
function checks if the context length exceeds the maximum allowed length and truncates it if necessary.LimitedRequestRouterQueryEngine
class wraps the RouterQueryEngine
and adds a request counter to limit the number of requests.backoff.on_exception
decorator retries the request with exponential backoff when a RateLimitError
is encountered.By combining these mechanisms, you can effectively manage the rate and number of queries to the RouterQueryEngine
and handle rate limit errors gracefully [1][2].
To continue talking to Dosu, mention @dosu.
What if I am using the different embedding model?
from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core import Settings
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.embed_model=embed_model
From: dosubot[bot] @.> Date: Friday, July 5, 2024 at 3:22 PM To: run-llama/llama_index @.> Cc: gmdl007 @.>, Author @.> Subject: Re: [run-llama/llama_index] [Question]: how to slowdown the summary_index query (Issue #14551)
To stop your code from sending too many requests to the Azure OpenAI server when using summary_index and implement a limit on the number of requests, you can introduce a delay between requests to ensure you stay within the rate limits. Additionally, you can use the GPTSimpleVectorIndex.from_documents() method to manage the rate of embedding creation. Here is an example of how you can do this in Python:
import time
import openai
def create_embeddings_with_rate_limit(documents, rate_limit_per_minute):
interval = 60 / rate_limit_per_minute # Calculate the interval between requests
embeddings = []
for doc in documents:
try:
# Create embedding for the document
embedding = openai.Embedding.create(input=doc, model="text-embedding-ada-002")
embeddings.append(embedding)
# Wait for the interval to respect the rate limit
time.sleep(interval)
except openai.error.RateLimitError as e:
print(f"Rate limit error: {e}")
# Optionally, you can implement a backoff strategy here
time.sleep(interval)
return embeddings
documents = ["Document 1", "Document 2", "Document 3"]
rate_limit_per_minute = 60 # Set your rate limit here
embeddings = create_embeddings_with_rate_limit(documents, rate_limit_per_minute)
This code ensures that you do not exceed the rate limit by introducing a delay between each request. Adjust the rate_limit_per_minute variable to match your specific rate limit.
Additionally, if you are using GPTSimpleVectorIndex.from_documents() and encountering rate limit issues, you can manage the rate of requests by implementing a similar delay mechanism or by batching your requests to stay within the allowed limits. If you continue to face issues, consider reaching out to OpenAI support or upgrading your pricing tier to increase your rate limits [1]https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_05_25_23.json[2]https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_06_02_23.json.
To continue talking to Dosuhttps://dosu.dev, mention @dosuhttps://github.com/dosu.
— Reply to this email directly, view it on GitHubhttps://github.com/run-llama/llama_index/issues/14551#issuecomment-2210871617, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG2FHXGCCSLTTZFBBBZCJTTZK2M2NAVCNFSM6AAAAABKKHCLW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQHA3TCNRRG4. You are receiving this because you authored the thread.Message ID: @.***>
When using different embedding models like HuggingFace embeddings, you can still implement rate limiting and backoff strategies to manage the rate of requests and handle rate limit errors. Here is how you can integrate these mechanisms into your existing code:
import time
import threading
from functools import wraps
import backoff
from requests.exceptions import HTTPError
# Rate limiting decorator
def rate_limit(*, allowed_per_second: int):
max_period = 1.0 / allowed_per_second
last_call = [time.perf_counter()]
lock = threading.Lock()
def decorate(func):
@wraps(func)
def limit(*args, **kwargs):
with lock:
elapsed = time.perf_counter() - last_call[0]
hold = max_period - elapsed
if hold > 0:
time.sleep(hold)
result = func(*args, **kwargs)
last_call[0] = time.perf_counter()
return result
return limit
return decorate
# Function to handle context length
def handle_context_length(context, max_length):
if len(context) > max_length:
print(f"Context length exceeded: {len(context)} > {max_length}")
context = context[:max_length]
return context
class LimitedRequestRouterQueryEngine:
def __init__(self, query_engine, request_limit, max_context_length):
self.query_engine = query_engine
self.request_limit = request_limit
self.request_count = 0
self.max_context_length = max_context_length
@rate_limit(allowed_per_second=0.1) # Adjust the rate as needed
@backoff.on_exception(backoff.expo, HTTPError, max_tries=5)
def query(self, query):
if self.request_count >= self.request_limit:
print("Request limit reached. No more requests will be sent.")
return None
self.request_count += 1
query = handle_context_length(query, self.max_context_length)
return self.query_engine.query(query)
# Example usage with HuggingFace embeddings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector
from llama_index.core.tools import QueryEngineTool
# Use HuggingFace embeddings
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.embed_model = embed_model
# Define your query engines
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
list_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
vector_query_engine = vector_index.as_query_engine()
list_tool = QueryEngineTool.from_defaults(
query_engine=list_query_engine,
description="Useful for summarization questions related to 8k on its documentations",
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description="Useful for retrieving specific context from 8k on its documentations.",
)
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[list_tool, vector_tool],
)
# Instantiate the LimitedRequestRouterQueryEngine with a request limit and max context length
limited_query_engine = LimitedRequestRouterQueryEngine(query_engine, request_limit=3, max_context_length=4096)
# Use the limited query engine
response = limited_query_engine.query("summarize the xyz configuration steps")
print(str(response))
rate_limit
decorator ensures that the query
method does not exceed the specified rate limit.handle_context_length
function checks if the context length exceeds the maximum allowed length and truncates it if necessary.LimitedRequestRouterQueryEngine
class wraps the RouterQueryEngine
and adds a request counter to limit the number of requests.backoff.on_exception
decorator retries the request with exponential backoff when a RateLimitError
or HTTPError
is encountered.By combining these mechanisms, you can effectively manage the rate and number of queries to the RouterQueryEngine
and handle rate limit errors gracefully, even when using different embedding models like HuggingFace embeddings [1][2].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
hi,
i am building a code for router agent to route between retrieval and summary for some docs, here is my code.
from llama_index.core import SummaryIndex from llama_index.core import VectorStoreIndex
summary_index = SummaryIndex(nodes, storage_context=storage_context) vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
list_query_engine = summary_index.as_query_engine( response_mode="tree_summarize", use_async=True, ) vector_query_engine = vector_index.as_query_engine()
from llama_index.core.tools import QueryEngineTool
list_tool = QueryEngineTool.from_defaults( query_engine=list_query_engine, description=( "Useful for summarization questions related to 8k on" " its documentations" ), )
vector_tool = QueryEngineTool.from_defaults( query_engine=vector_query_engine, description=( "Useful for retrieving specific context from 8k on" "its documentations." ), )
from llama_index.core.query_engine import RouterQueryEngine from llama_index.core.selectors import LLMSingleSelector, LLMMultiSelector from llama_index.core.selectors import ( PydanticMultiSelector, PydanticSingleSelector, )
query_engine = RouterQueryEngine( selector=PydanticSingleSelector.from_defaults(), query_engine_tools=[ list_tool, vector_tool, ], )
response = query_engine.query("give me the summary of the xyz") print(str(response))
INFO:llama_index.core.query_engine.router_query_engine:Selecting query engine 1: The summary of the upgrade process is a specific context that can be retrieved from the 8000 documentation.. Selecting query engine 1: The summary of the upgrade process is a specific context that can be retrieved from the 8000 documentation..
since the azure openAI is rate limited to certain rate, once the query started, it is too fast for the azure openai to handle: INFO:openai._base_client:Retrying request to /deployments/gpt-35-turbo/chat/completions in 0.904817 seconds Retrying request to /deployments/gpt-35-turbo/chat/completions in 0.904817 seconds INFO:httpx:HTTP Request: POST https://chat-ai.com/openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-07-01-preview "HTTP/1.1 429 Too Many Requests" HTTP Request: POST https://chat-ai.com/openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-07-01-preview "HTTP/1.1 429 Too Many Requests" INFO:httpx:HTTP Request: POST https://chat-ai.com/openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-07-01-preview "HTTP/1.1 429 Too Many Requests"
can you help to reduce the rate of doing summary_index query?
Best