[Feature Request]: Token Rate limit setting for non-OpenAI models

Feature Description

currently the system doesn't have built-in support for controlling rate limit for alternative models, which might be a problem for users (especially in productions) wishing to run workflows over very large amount of input data.

for example, in lionagi, we can do the following

service=Services.OpenAI(token_limit=1_000, request_limit=1_000, interval=60)

session1 = Session(service=service)
session2 = Session(service=service)
session3 = Session(service=service)

And the rate limit is now applied across all sessions. I am integrating LlamaIndex into lionagi as a dependency, and this feature is essential.

Reason

I think the demand is not too high yet

Value of Feature

This feature is invaluable for business users. The ability to control rate limit is crucial.

🤖

Hey @ohdearquant, great to cross paths with you again! Hope all has been well on your end.👋

Thank you for your feature request. I understand that you're looking for a way to control the rate limit for alternative models in LlamaIndex, similar to the functionality in lionagi.

Currently, LlamaIndex does not have built-in support for rate limiting. However, it's possible to implement this feature using Python's asyncio library, which is already being used in LlamaIndex. Specifically, the asyncio.Semaphore class can be used to limit the number of coroutines running simultaneously, effectively controlling the rate limit.

Here's a simplified example of how this could be implemented:

import asyncio

class RateLimiter:
    def __init__(self, rate_limit):
        self.semaphore = asyncio.Semaphore(rate_limit)

    async def execute(self, job):
        async with self.semaphore:
            return await job

In this example, RateLimiter is a class that takes a rate_limit as an argument and uses it to create a semaphore. The execute method is a coroutine that waits until it can acquire the semaphore before it starts running the job. This effectively limits the number of jobs that can run concurrently to the rate_limit, thus controlling the rate limit.

This is a simplified example and the actual implementation would need to be integrated into the existing LlamaIndex codebase. However, it should give you a good starting point for how to implement this feature.

Please note that this is a significant change to the LlamaIndex codebase and would need to be thoroughly tested to ensure it doesn't introduce any regressions or performance issues. It would also need to be documented so that users are aware of the new feature and how to use it.

I hope this helps! If you have any further questions or need more clarification, please don't hesitate to ask.

Here is the official Python documentation for the asyncio.Semaphore class for your reference.

Here is the relevant code from the LlamaIndex repository that uses the asyncio.Semaphore class.

Here is the code from the LlamaIndex repository that could potentially be modified to include rate limiting.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

run-llama / llama_index