redhat-et / foundation-models-for-documentation

Improve ROSA customer experience (and customer retention) by leveraging foundation models to do “gpt-chat” style search of Red Hat customer documentation assets.
Other
24 stars 11 forks source link

Optimizing LLMs for max performance when serving on ODH #48

Open codificat opened 1 year ago

codificat commented 1 year ago

What is the resource requirement of the deployed model? Explain the resources defined for the model pod.

What is the throughput of the model? How can we increase the throughput?

Given a combination of hardware, model type, and optimization techniques, what can be the maximum expected and observed throughput?

ishaan-jaff commented 7 months ago

@codificat Hi i'm the maintainer of LiteLLM and we allow you to max throughput by load balancing between multiple LLM endpoints. Thought it would be useful for you, I'd love feedback if not

Here's the quick start, to use LiteLLM load balancer (works with 100+ LLMs) doc: https://docs.litellm.ai/docs/simple_proxy#model-alias

Step 1 Create a Config.yaml

model_list:
- model_name: openhermes
  litellm_params:
      model: openhermes
      temperature: 0.6
      max_tokens: 400
      custom_llm_provider: "openai"
      api_base: http://192.168.1.23:8000/v1
- model_name: openhermes
  litellm_params:
      model: openhermes
      custom_llm_provider: "openai"
      api_base: http://192.168.1.23:8001/v1
- model_name: openhermes
  litellm_params:
      model: openhermes
      custom_llm_provider: "openai"
      frequency_penalty : 0.6
      api_base: http://192.168.1.23:8010/v1

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "openhermes",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'