Closed rmccorm4 closed 2 months ago
Example of doing inference via OpenAI completions, chat, and triton kserve grpc all from same app running Triton in-process:
OpenAI Chat
$ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "llama-3.1-8b-instruct",
"prompt": "Machine learning is"
}' | jq
{
"id": "cmpl-d004b6b0-7cf1-11ef-90ff-04d4c4933ecf",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": " a subfield of artificial intelligence (AI) that involves training algorithms to automatically improve"
}
],
"created": 1727456349,
"model": "llama-3.1-8b-instruct",
"system_fingerprint": null,
"object": "text_completion",
"usage": null
}
OpenAI Completions
$ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "What is machine learning?"}]
}' | jq
{
"id": "cmpl-dca120a2-7cf1-11ef-90ff-04d4c4933ecf",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "Machine learning is a subset of artificial intelligence (AI) that involves the use of",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1727456370,
"model": "llama-3.1-8b-instruct",
"system_fingerprint": null,
"object": "chat.completion",
"usage": null
}
Triton/Kserve Streaming GRPC (via Triton CLI for simplicity, but can be client library instead):
$ triton infer -m llama-3.1-8b-instruct --prompt "Machine learning is" -u localhost -p 8001
triton - INFO - Input:
{
"name": "text_input",
"shape": "(1,)",
"dtype": "BYTES",
"value": "['Machine learning is']"
}
triton - WARNING - Skipping optional input 'stream'
triton - WARNING - Skipping optional input 'sampling_parameters'
triton - WARNING - Skipping optional input 'exclude_input_in_output'
triton - INFO - Sending inference request...
triton - INFO - Output:
{
"name": "text_output",
"shape": "(1,)",
"dtype": "BYTES",
"value": "['Machine learning is a subfield of artificial intelligence that engages the use of statistical methods mixed with non']"
}
args.port
instead ofargs.openai_port
--enable-kserve-frontends
to allow opt-in since it's currently an "openai_frontend" application, and in case we need to allow users to disable kserve portions for some reasonCleanup to
--help
output example: