sanity checking profile results to make sure the expected number of responses are generated
using logger instead of print statement
adding verbose mode
The sanity check prevents potential miscalculation that can arise due to user requesting the model to generate a sequence that's longer than the model's context length.
Non-verbose (no error):
root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
* Model: gpt2
* Backend: vllm
* Batch size: 1
* Input tokens: 800
* Output tokens: 128
[ BENCHMARK SUMMARY ]
* Avg first token latency: 27.3099 ms
* Avg generation throughput: 186.8861 output tokens/s
triton - INFO - Stopping server...
Non-verbose (with PA error):
root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - ERROR - Unexpected error: Encountered the following error while running Perf Analyzer:
Failed to retrieve results from inference request.
Thread [0] had error: Error generating stream: 'ascii' codec can't encode character '\xab' in position 77: ordinal not in range(128)
triton - INFO - Stopping server...
Non-verbose (with sanity check error):
root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
* Model: gpt2
* Backend: vllm
* Batch size: 1
* Input tokens: 1024
* Output tokens: 128
triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...
Verbose (no error):
root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128 --verbose
triton - DEBUG - Using existing model repository: /root/models
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - DEBUG - HuggingFace prefix detected, parsing HuggingFace ID
triton - DEBUG - Adding new model to repo at: /root/models/gpt2/1
triton - INFO - Current repo at /root/models:
models/
└── gpt2/
├── 1/
│ └── model.json
└── config.pbtxt
triton - DEBUG - No --mode specified, trying the following modes: ['local', 'docker']
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - DEBUG - Failed to start server in 'docker' mode. Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Perf Analyzer output:
Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using asynchronous calls for inference
Detected decoupled model, using the first response for measuring latency
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 71
Throughput: 23.6401 infer/sec
Response Throughput: 23.6401 infer/sec
Avg latency: 41843 usec (standard deviation 3879 usec)
p50 latency: 42018 usec
p90 latency: 47008 usec
p95 latency: 48011 usec
p99 latency: 48873 usec
Server:
Inference count: 72
Execution count: 72
Successful request count: 72
Avg request latency: 510 usec (overhead 4 usec + queue 71 usec + compute input 42 usec + compute infer 387 usec + compute output 6 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 23.6401 infer/sec, latency 41843 usec
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
* Model: gpt2
* Backend: vllm
* Batch size: 1
* Input tokens: 1024
* Output tokens: 128
triton - INFO - Perf Analyzer output:
Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using asynchronous calls for inference
Detected decoupled model, using the first response for measuring latency
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 96
Throughput: 31.9703 infer/sec
Response Throughput: 31.9703 infer/sec
Avg latency: 31033 usec (standard deviation 11848 usec)
p50 latency: 35807 usec
p90 latency: 44262 usec
p95 latency: 45771 usec
p99 latency: 48866 usec
Server:
Inference count: 97
Execution count: 97
Successful request count: 97
Avg request latency: 473 usec (overhead 3 usec + queue 60 usec + compute input 40 usec + compute infer 364 usec + compute output 5 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 31.9703 infer/sec, latency 31033 usec
triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...
triton - DEBUG - Stopped Triton Server.
The change includes:
The sanity check prevents potential miscalculation that can arise due to user requesting the model to generate a sequence that's longer than the model's context length.
Non-verbose (no error):
Non-verbose (with PA error):
Non-verbose (with sanity check error):
Verbose (no error):