kaito-project / kaito

Kubernetes AI Toolchain Operator
MIT License
423 stars 49 forks source link

Support more llm runtime #608

Open zhuangqh opened 1 month ago

zhuangqh commented 1 month ago

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Today KAITO supports the popular huggingface runtime. We should support other runtime like vllm

Describe alternatives you've considered

Additional context

zhuangqh commented 2 weeks ago

Motivation

Nowadays, KAITO use the huggingface transformers runtime to build the inference and tuning service. It offers an out-of-the-box experience for nearly all transformer-based models hosted on Hugging Face.

However, there are other LLM inference libraries like vLLM and TensorRT-LLM, which pay more attention to inference performance and resource efficiency. Many of our users will use these libraries as their inference engine.

Goals

Non-Goals

Design Details

Inference server API

inference api

health check api

metric

Workspace CRD change

Change the default runtime from huggingface/transformers to vllm. Considering the compatibility for out-of-tree models, provide an annotation that allows the user to fall back to the huggingface/transformers runtime.

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-phi-3-mini
  annotations:
    workspace.kaito.io/runtime: "huggingface"
resource:
  instanceType: "Standard_NC6s_v3"
  labelSelector:
    matchLabels:
      apps: phi-3
inference:
  preset:
    name: phi-3-mini-4k-instruct

Engine default parameter

Choose better default engine arguments for user.

notes: https://docs.vllm.ai/en/latest/models/engine_args.html

TODO

Appendix

Support matrix

huggingface vLLM TensorRT-LLM
support models 272 78 54
notes https://huggingface.co/docs/transformers/index#supported-models-and-frameworks https://docs.vllm.ai/en/latest/models/supported_models.html https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html

Performance benchmark

At the end of this blog.

Out of Memory problem

  1. Start a vllm inference server with zero cpu memory swap space.
python ./inference_api_vllm.py --swap-space 0

Make a request with large sequences.

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

completion = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "What is kubernetes?"
        }
    ],
    n=10000,
)

print(completion.choices[0].message)

Server will exit with error.

INFO 10-14 11:24:04 logger.py:36] Received request chat-90e9bde7074e402bb284fd0ab0c7d7e8: prompt: '<s><|user|>\nWhat is kubernetes?<|end|>\n<|assistant|>\n', params: SamplingParams(n=100, best_of=100, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4087, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 32010, 1724, 338, 413, 17547, 29973, 32007, 32001], lora_request: None, prompt_adapter_request: None.
INFO 10-14 11:24:04 engine.py:288] Added request chat-90e9bde7074e402bb284fd0ab0c7d7e8.
WARNING 10-14 11:24:09 scheduler.py:1439] Sequence group chat-90e9bde7074e402bb284fd0ab0c7d7e8 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
CRITICAL 10-14 11:24:09 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     ::1:58040 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-14 11:24:09 engine.py:157] RuntimeError('Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.')
  1. Start the server with larger swap-space.
python ./inference_api_vllm.py --swap-space 8

The request was successfully processed.

INFO 10-14 11:28:42 logger.py:36] Received request chat-f9f440781d3a45e5be01e8f3fd16f661: prompt: '<s><|user|>\nWhat is kubernetes?<|end|>\n<|assistant|>\n', params: SamplingParams(n=100, best_of=100, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4087, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 32010, 1724, 338, 413, 17547, 29973, 32007, 32001], lora_request: None, prompt_adapter_request: None.
INFO 10-14 11:28:42 engine.py:288] Added request chat-f9f440781d3a45e5be01e8f3fd16f661.
WARNING 10-14 11:28:47 scheduler.py:1439] Sequence group chat-f9f440781d3a45e5be01e8f3fd16f661 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
WARNING 10-14 11:28:48 scheduler.py:691] Failing the request chat-f9f440781d3a45e5be01e8f3fd16f661 because there's not enough kv cache blocks to run the entire sequence.
INFO:     ::1:50168 - "POST /v1/chat/completions HTTP/1.1" 200 OK