triton-inference-server / triton_cli

Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.
48 stars 2 forks source link

Catch errors and improve logging in Profiler #23

Closed nv-hwoo closed 9 months ago

nv-hwoo commented 9 months ago

The change includes:

The sanity check prevents potential miscalculation that can arise due to user requesting the model to generate a sequence that's longer than the model's context length.

Non-verbose (no error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 800
 * Output tokens: 128

[ BENCHMARK SUMMARY ]
 * Avg first token latency: 27.3099 ms
 * Avg generation throughput: 186.8861 output tokens/s

triton - INFO - Stopping server...

Non-verbose (with PA error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - ERROR - Unexpected error: Encountered the following error while running Perf Analyzer:
Failed to retrieve results from inference request.
Thread [0] had error: Error generating stream: 'ascii' codec can't encode character '\xab' in position 77: ordinal not in range(128)
triton - INFO - Stopping server...

Non-verbose (with sanity check error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 1024
 * Output tokens: 128

triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...

Verbose (no error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128 --verbose
triton - DEBUG - Using existing model repository: /root/models
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - DEBUG - HuggingFace prefix detected, parsing HuggingFace ID
triton - DEBUG - Adding new model to repo at: /root/models/gpt2/1
triton - INFO - Current repo at /root/models:
models/
└── gpt2/
    ├── 1/
    │   └── model.json
    └── config.pbtxt
triton - DEBUG - No --mode specified, trying the following modes: ['local', 'docker']
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - DEBUG - Failed to start server in 'docker' mode. Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Perf Analyzer output:
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using asynchronous calls for inference
  Detected decoupled model, using the first response for measuring latency
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 71
    Throughput: 23.6401 infer/sec
    Response Throughput: 23.6401 infer/sec
    Avg latency: 41843 usec (standard deviation 3879 usec)
    p50 latency: 42018 usec
    p90 latency: 47008 usec
    p95 latency: 48011 usec
    p99 latency: 48873 usec

  Server:
    Inference count: 72
    Execution count: 72
    Successful request count: 72
    Avg request latency: 510 usec (overhead 4 usec + queue 71 usec + compute input 42 usec + compute infer 387 usec + compute output 6 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 23.6401 infer/sec, latency 41843 usec

triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 1024
 * Output tokens: 128

triton - INFO - Perf Analyzer output:
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using asynchronous calls for inference
  Detected decoupled model, using the first response for measuring latency
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 96
    Throughput: 31.9703 infer/sec
    Response Throughput: 31.9703 infer/sec
    Avg latency: 31033 usec (standard deviation 11848 usec)
    p50 latency: 35807 usec
    p90 latency: 44262 usec
    p95 latency: 45771 usec
    p99 latency: 48866 usec

  Server:
    Inference count: 97
    Execution count: 97
    Successful request count: 97
    Avg request latency: 473 usec (overhead 3 usec + queue 60 usec + compute input 40 usec + compute infer 364 usec + compute output 5 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 31.9703 infer/sec, latency 31033 usec

triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...
triton - DEBUG - Stopped Triton Server.