OpenAI Proxy - Githubissues

PSU3D0 commented 9 months ago

Hello all! Was curious if anyone has used FastChat as a proxy around OpenAI/Azure endpoints.

Use case:

We have multiple nodes all making calls to either Azure or OpenAI chat endpoints. We have two main problems.

We'd like to store all prompts and completions made to OpenAI in a central repository for our data team. While this can be achieved using Langchain callbacks, it's not the best solution.
We throttle our completions to stay within our (elevated) rate limits. We've considered using a "leaky bucket" approach with redis to handle rate limiting, or having this centralized proxy server act as the global rate limit , which can simply queue requests during peak times and otherwise let requests through instantaneously, which, in aggregate, would be much faster than our hard-coded token throughput limits for each of our processes that is making API calls.

If anyone has done this before with FastChat, or used something similar, I'd love to hear about it. Cheers!

chymian commented 9 months ago

I think liteLLM, which is a API-Middleware/Proxy, has rate-limiting implemented.

krrishdholakia commented 9 months ago

hey @PSU3D0, I'm the maintainer of LiteLLM we allow you to create a Router to maximize throughput by load balancing + queuing (beta).

I'd love to get your feedback if this solves your issue

Here's the quick start

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = await router.acompletion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

lm-sys / FastChat

OpenAI Proxy #2693