Open codificat opened 1 year ago
@codificat Hi i'm the maintainer of LiteLLM and we allow you to max throughput by load balancing between multiple LLM endpoints. Thought it would be useful for you, I'd love feedback if not
Here's the quick start, to use LiteLLM load balancer (works with 100+ LLMs) doc: https://docs.litellm.ai/docs/simple_proxy#model-alias
model_list:
- model_name: openhermes
litellm_params:
model: openhermes
temperature: 0.6
max_tokens: 400
custom_llm_provider: "openai"
api_base: http://192.168.1.23:8000/v1
- model_name: openhermes
litellm_params:
model: openhermes
custom_llm_provider: "openai"
api_base: http://192.168.1.23:8001/v1
- model_name: openhermes
litellm_params:
model: openhermes
custom_llm_provider: "openai"
frequency_penalty : 0.6
api_base: http://192.168.1.23:8010/v1
litellm --config /path/to/config.yaml
curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "openhermes",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'
What is the resource requirement of the deployed model? Explain the resources defined for the model pod.
What is the throughput of the model? How can we increase the throughput?
Given a combination of hardware, model type, and optimization techniques, what can be the maximum expected and observed throughput?