Open zhuangqh opened 1 month ago
Nowadays, KAITO use the huggingface transformers runtime to build the inference and tuning service. It offers an out-of-the-box experience for nearly all transformer-based models hosted on Hugging Face.
However, there are other LLM inference libraries like vLLM and TensorRT-LLM, which pay more attention to inference performance and resource efficiency. Many of our users will use these libraries as their inference engine.
inference api
health check api
metric
Change the default runtime from huggingface/transformers to vllm. Considering the compatibility for out-of-tree models, provide an annotation that allows the user to fall back to the huggingface/transformers runtime.
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-phi-3-mini
annotations:
workspace.kaito.io/runtime: "huggingface"
resource:
instanceType: "Standard_NC6s_v3"
labelSelector:
matchLabels:
apps: phi-3
inference:
preset:
name: phi-3-mini-4k-instruct
Choose better default engine arguments for user.
notes: https://docs.vllm.ai/en/latest/models/engine_args.html
huggingface | vLLM | TensorRT-LLM | |
---|---|---|---|
support models | 272 | 78 | 54 |
notes | https://huggingface.co/docs/transformers/index#supported-models-and-frameworks | https://docs.vllm.ai/en/latest/models/supported_models.html | https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html |
At the end of this blog.
python ./inference_api_vllm.py --swap-space 0
Make a request with large sequences.
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "What is kubernetes?"
}
],
n=10000,
)
print(completion.choices[0].message)
Server will exit with error.
INFO 10-14 11:24:04 logger.py:36] Received request chat-90e9bde7074e402bb284fd0ab0c7d7e8: prompt: '<s><|user|>\nWhat is kubernetes?<|end|>\n<|assistant|>\n', params: SamplingParams(n=100, best_of=100, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4087, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 32010, 1724, 338, 413, 17547, 29973, 32007, 32001], lora_request: None, prompt_adapter_request: None.
INFO 10-14 11:24:04 engine.py:288] Added request chat-90e9bde7074e402bb284fd0ab0c7d7e8.
WARNING 10-14 11:24:09 scheduler.py:1439] Sequence group chat-90e9bde7074e402bb284fd0ab0c7d7e8 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
CRITICAL 10-14 11:24:09 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO: ::1:58040 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-14 11:24:09 engine.py:157] RuntimeError('Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.')
python ./inference_api_vllm.py --swap-space 8
The request was successfully processed.
INFO 10-14 11:28:42 logger.py:36] Received request chat-f9f440781d3a45e5be01e8f3fd16f661: prompt: '<s><|user|>\nWhat is kubernetes?<|end|>\n<|assistant|>\n', params: SamplingParams(n=100, best_of=100, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4087, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 32010, 1724, 338, 413, 17547, 29973, 32007, 32001], lora_request: None, prompt_adapter_request: None.
INFO 10-14 11:28:42 engine.py:288] Added request chat-f9f440781d3a45e5be01e8f3fd16f661.
WARNING 10-14 11:28:47 scheduler.py:1439] Sequence group chat-f9f440781d3a45e5be01e8f3fd16f661 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
WARNING 10-14 11:28:48 scheduler.py:691] Failing the request chat-f9f440781d3a45e5be01e8f3fd16f661 because there's not enough kv cache blocks to run the entire sequence.
INFO: ::1:50168 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
Today KAITO supports the popular huggingface runtime. We should support other runtime like vllm
Describe alternatives you've considered
Additional context