Support rate limiting - Githubissues

Morsey187 commented 11 months ago

Add support for raising a custom Wagtail AI rate limit exceptions.

I'm not aware of any existing support for rate limiting within wagtail and unsure what library would be preferable to use here, so I can't suggest an approach, however, I'd imagine we'd want to support limiting not only requests but also tokens per a user account. Allowing developers to configure the package so that individual editors activity doesn't effect one another i.e. editor 1 reaching the usage limit for the whole organisation account, thus preventing editor 2 from using AI tools.

tm-kn commented 11 months ago

We'd need to investigate if we can catch those in the AI backend implementation.

It looks like those would need to be implemented in https://github.com/simonw/llm directly and then we could catch the "llm" package's exceptions, or if there's an HTTP response returned, we could use the status code to figure that out.

We can't guarantee that our local environment will have all the optional dependencies installed.

Another way might be something like this which is still not ideal, but a good trade-off if the user experience matters.

def get_rate_limitting_exceptions() -> Generator[Exception, None, None]:
    try:
        import openai
    except ImportError:
        pass
    else:
        yield openai.RateLimitError

    try:
        import another_package
    except ImportError:
        pass
    else:
        yield another_package.RateLimitException

def handle(prompt, context):
    try:
        backend.prompt_with_context(prompt, context)
    except Exception as e:
        rate_limit_exception_classes = tuple(get_rate_limitting_exceptions())
        if rate_limit_exception_classes and isinstance(e, rate_limit_exception_classes):
            raise WagtailAiRateLimitError from e
        raise

ishaan-jaff commented 11 months ago

@Morsey187 @tm-kn

I'm the maintainer of LiteLLM we provide an Open source proxy for load balancing Azure + OpenAI + Any LiteLLM supported LLM It can process (500+ requests/second)

From this thread it looks like you're trying to handle rate limits + load balance between OpenAI instance - I hope our solution makes it easier for you. (i'd love feedback if you're trying to do this)

Here's the quick start:

Doc: https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model

Step 1 Create a Config.yaml

model_list:
  - model_name: gpt-4
    litellm_params:
      model: azure/chatgpt-v-2
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
      api_version: "2023-05-15"
      api_key: 
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_key: 
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_key: 
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-4",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

wagtail / wagtail-ai

Support rate limiting #37

Here's the quick start:

Step 1 Create a Config.yaml

Step 2: Start the litellm proxy:

Step3 Make Request to LiteLLM proxy: