chore: Fix argparse typo, cleanup argparse groups, make kserve frontends optional

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

BSD 3-Clause "New" or "Revised" License

8.4k stars 1.49k forks source link

# python3 openai_frontend/main.py --help usage: main.py [-h] --model-repository MODEL_REPOSITORY [--tokenizer TOKENIZER] [--backend {vllm,tensorrtllm}] [--tritonserver-log-verbose-level TRITONSERVER_LOG_VERBOSE_LEVEL] [--host HOST] [--openai-port OPENAI_PORT] [--uvicorn-log-level {debug,info,warning,error,critical,trace}] [--enable-kserve-frontends] [--kserve-http-port KSERVE_HTTP_PORT] [--kserve-grpc-port KSERVE_GRPC_PORT] Triton Inference Server with OpenAI-Compatible RESTful API server. options: -h, --help show this help message and exit Triton Inference Server: --model-repository MODEL_REPOSITORY Path to the Triton model repository holding the models to be served --tokenizer TOKENIZER HuggingFace ID or local folder path of the Tokenizer to use for chat templates --backend {vllm,tensorrtllm} Manual override of Triton backend request format (inputs/output names) to use for inference --tritonserver-log-verbose-level TRITONSERVER_LOG_VERBOSE_LEVEL The tritonserver log verbosity level --host HOST Address/host of frontends (default: '0.0.0.0') Triton OpenAI-Compatible Frontend: --openai-port OPENAI_PORT OpenAI HTTP port (default: 9000) --uvicorn-log-level {debug,info,warning,error,critical,trace} log level for uvicorn Triton KServe Frontend: --enable-kserve-frontends Enable KServe Predict v2 HTTP/GRPC frontends (disabled by default) --kserve-http-port KSERVE_HTTP_PORT KServe Predict v2 HTTP port (default: 8000) --kserve-grpc-port KSERVE_GRPC_PORT KServe Predict v2 GRPC port (default: 8001)

Example of doing inference via OpenAI completions, chat, and triton kserve grpc all from same app running Triton in-process:

OpenAI Chat

$ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "llama-3.1-8b-instruct",
  "prompt": "Machine learning is"
}' | jq
{
  "id": "cmpl-d004b6b0-7cf1-11ef-90ff-04d4c4933ecf",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " a subfield of artificial intelligence (AI) that involves training algorithms to automatically improve"
    }
  ],
  "created": 1727456349,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": null
}

OpenAI Completions

$ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "llama-3.1-8b-instruct",
  "messages": [{"role": "user", "content": "What is machine learning?"}]
}' | jq
{
  "id": "cmpl-dca120a2-7cf1-11ef-90ff-04d4c4933ecf",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Machine learning is a subset of artificial intelligence (AI) that involves the use of",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1727456370,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": null
}

Triton/Kserve Streaming GRPC (via Triton CLI for simplicity, but can be client library instead):

$ triton infer -m llama-3.1-8b-instruct --prompt "Machine learning is" -u localhost -p 8001
triton - INFO - Input:
{
    "name": "text_input",
    "shape": "(1,)",
    "dtype": "BYTES",
    "value": "['Machine learning is']"
}
triton - WARNING - Skipping optional input 'stream'
triton - WARNING - Skipping optional input 'sampling_parameters'
triton - WARNING - Skipping optional input 'exclude_input_in_output'
triton - INFO - Sending inference request...
triton - INFO - Output:
{
    "name": "text_output",
    "shape": "(1,)",
    "dtype": "BYTES",
    "value": "['Machine learning is a subfield of artificial intelligence that engages the use of statistical methods mixed with non']"
}

triton-inference-server / server

chore: Fix argparse typo, cleanup argparse groups, make kserve frontends optional #7663