triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

chore: Fix argparse typo, cleanup argparse groups, make kserve frontends optional #7663

Closed rmccorm4 closed 2 months ago

rmccorm4 commented 2 months ago

Cleanup to --help output example:

# python3 openai_frontend/main.py --help
usage: main.py [-h] --model-repository MODEL_REPOSITORY [--tokenizer TOKENIZER] [--backend {vllm,tensorrtllm}]
               [--tritonserver-log-verbose-level TRITONSERVER_LOG_VERBOSE_LEVEL] [--host HOST]
               [--openai-port OPENAI_PORT] [--uvicorn-log-level {debug,info,warning,error,critical,trace}]
               [--enable-kserve-frontends] [--kserve-http-port KSERVE_HTTP_PORT]
               [--kserve-grpc-port KSERVE_GRPC_PORT]

Triton Inference Server with OpenAI-Compatible RESTful API server.

options:
  -h, --help            show this help message and exit

Triton Inference Server:
  --model-repository MODEL_REPOSITORY
                        Path to the Triton model repository holding the models to be served
  --tokenizer TOKENIZER
                        HuggingFace ID or local folder path of the Tokenizer to use for chat templates
  --backend {vllm,tensorrtllm}
                        Manual override of Triton backend request format (inputs/output names) to use for inference
  --tritonserver-log-verbose-level TRITONSERVER_LOG_VERBOSE_LEVEL
                        The tritonserver log verbosity level
  --host HOST           Address/host of frontends (default: '0.0.0.0')

Triton OpenAI-Compatible Frontend:
  --openai-port OPENAI_PORT
                        OpenAI HTTP port (default: 9000)
  --uvicorn-log-level {debug,info,warning,error,critical,trace}
                        log level for uvicorn

Triton KServe Frontend:
  --enable-kserve-frontends
                        Enable KServe Predict v2 HTTP/GRPC frontends (disabled by default)
  --kserve-http-port KSERVE_HTTP_PORT
                        KServe Predict v2 HTTP port (default: 8000)
  --kserve-grpc-port KSERVE_GRPC_PORT
                        KServe Predict v2 GRPC port (default: 8001)
rmccorm4 commented 2 months ago

Example of doing inference via OpenAI completions, chat, and triton kserve grpc all from same app running Triton in-process:

OpenAI Chat

$ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "llama-3.1-8b-instruct",
  "prompt": "Machine learning is"
}' | jq
{
  "id": "cmpl-d004b6b0-7cf1-11ef-90ff-04d4c4933ecf",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " a subfield of artificial intelligence (AI) that involves training algorithms to automatically improve"
    }
  ],
  "created": 1727456349,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": null
}

OpenAI Completions

$ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "llama-3.1-8b-instruct",
  "messages": [{"role": "user", "content": "What is machine learning?"}]
}' | jq
{
  "id": "cmpl-dca120a2-7cf1-11ef-90ff-04d4c4933ecf",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Machine learning is a subset of artificial intelligence (AI) that involves the use of",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1727456370,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": null
}

Triton/Kserve Streaming GRPC (via Triton CLI for simplicity, but can be client library instead):

$ triton infer -m llama-3.1-8b-instruct --prompt "Machine learning is" -u localhost -p 8001
triton - INFO - Input:
{
    "name": "text_input",
    "shape": "(1,)",
    "dtype": "BYTES",
    "value": "['Machine learning is']"
}
triton - WARNING - Skipping optional input 'stream'
triton - WARNING - Skipping optional input 'sampling_parameters'
triton - WARNING - Skipping optional input 'exclude_input_in_output'
triton - INFO - Sending inference request...
triton - INFO - Output:
{
    "name": "text_output",
    "shape": "(1,)",
    "dtype": "BYTES",
    "value": "['Machine learning is a subfield of artificial intelligence that engages the use of statistical methods mixed with non']"
}