Measuring inference speed metrics for hosted and local LLM
Snippet
GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. For a full list of metrics please see the Metrics section.
Users specify a model name, an inference server URL, the type of inputs to use (synthetic or from dataset), and the type of load to generate (number of concurrent requests, request rate).
GenAI-Perf generates the specified load, measures the performance of the inference server and reports the metrics in a simple table as console output. The tool also logs all results in a csv file that can be used to derive additional metrics and visualizations. The inference server must already be running when GenAI-Perf is run.
[!Note] GenAI-Perf is currently in early release and under rapid development. While we will try to remain consistent, command line options and functionality are subject to change as the tool matures.
Installation
Triton SDK Container
Available starting with the 24.03 release of the Triton Server SDK container.
Run the Triton Inference Server SDK docker container:
export RELEASE="mm.yy" # e.g. export RELEASE="24.03"
docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
Run GenAI-Perf:
genai-perf --help
From Source
This method requires that Perf Analyzer is installed in your development environment and that you have at least Python 3.10 installed. To build Perf Analyzer from source, see here.
export RELEASE="mm.yy" # e.g. export RELEASE="24.03"
pip install "git+https://github.com/triton-inference-server/client.git@r${RELEASE}#subdirectory=src/c++/perf_analyzer/genai-perf"
Run GenAI-Perf:
genai-perf --help
Quick Start
Measuring Throughput and Latency of GPT2 using Triton + TensorRT-LLM
Running GPT2 on Triton Inference Server using TensorRT-LLM
GenAI-Perf supports model input prompts from either synthetically generated inputs, or from the HuggingFace OpenOrca or CNN_DailyMail datasets. This is specified using the --input-dataset CLI option.
When the dataset is synthetic, you can specify the following options:
--num-prompts <int>: The number of unique prompts to generate as stimulus, >= 1.
--synthetic-input-tokens-mean <int>: The mean of number of tokens in the generated prompts when prompt-source is synthetic, >= 1.
--synthetic-input-tokens-stddev <int>: The standard deviation of number of tokens in the generated prompts when prompt-source is synthetic, >= 0.
--random-seed <int>: The seed used to generate random values, >= 0.
When the dataset is coming from HuggingFace, you can specify the following options:
--dataset {openorca,cnn_dailymail}: HuggingFace dataset to use for benchmarking.
--num-prompts <int>: The number of unique prompts to generate as stimulus, >= 1.
For any dataset, you can specify the following options:
--output-tokens-mean <int>: The mean number of tokens in each output. Ensure the --tokenizer value is set correctly, >= 1.
--output-tokens-stddev <int>: The standard deviation of the number of tokens in each output. This is only used when output-tokens-mean is provided, >= 1.
--output-tokens-mean-deterministic: When using --output-tokens-mean, this flag can be set to improve precision by setting the minimum number of tokens equal to the requested number of tokens. This is currently supported with the Triton service-kind. Note that there is still some variability in the requested number of output tokens, but GenAi-Perf attempts its best effort with your model to get the right number of output tokens.
You can optionally set additional model inputs with the following option:
--extra-inputs <input_name>:<value>: An additional input for use with the model with a singular value, such as stream:true or max_tokens:5. This flag can be repeated to supply multiple extra inputs.
Metrics
GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.
Metric
Description
Aggregations
Time to First Token
Time between when a request is sent and when its first response is received, one value per request in benchmark
Avg, min, max, p99, p90, p75
Inter Token Latency
Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark
Avg, min, max, p99, p90, p75
Request Latency
Time between when a request is sent and when its final response is received, one value per request in benchmark
Avg, min, max, p99, p90, p75
Number of Output Tokens
Total number of output tokens of a request, one value per request in benchmark
Avg, min, max, p99, p90, p75
Output Token Throughput
Total number of output tokens from benchmark divided by benchmark duration
None–one value per benchmark
Request Throughput
Number of final responses from benchmark divided by benchmark duration
Measuring inference speed metrics for hosted and local LLM
Snippet
GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. For a full list of metrics please see the Metrics section.
Users specify a model name, an inference server URL, the type of inputs to use (synthetic or from dataset), and the type of load to generate (number of concurrent requests, request rate).
GenAI-Perf generates the specified load, measures the performance of the inference server and reports the metrics in a simple table as console output. The tool also logs all results in a csv file that can be used to derive additional metrics and visualizations. The inference server must already be running when GenAI-Perf is run.
[!Note] GenAI-Perf is currently in early release and under rapid development. While we will try to remain consistent, command line options and functionality are subject to change as the tool matures.
Installation
Triton SDK Container
Available starting with the 24.03 release of the Triton Server SDK container.
Run the Triton Inference Server SDK docker container:
Run GenAI-Perf:
From Source
This method requires that Perf Analyzer is installed in your development environment and that you have at least Python 3.10 installed. To build Perf Analyzer from source, see here.
Run GenAI-Perf:
Quick Start
Measuring Throughput and Latency of GPT2 using Triton + TensorRT-LLM
Running GPT2 on Triton Inference Server using TensorRT-LLM
See instructions
Running GenAI-Perf
Run Triton Inference Server SDK container:
Run GenAI-Perf:
Example output:
See Tutorial for additional examples.
Model Inputs
GenAI-Perf supports model input prompts from either synthetically generated inputs, or from the HuggingFace OpenOrca or CNN_DailyMail datasets. This is specified using the
--input-dataset
CLI option.When the dataset is synthetic, you can specify the following options:
--num-prompts <int>
: The number of unique prompts to generate as stimulus, >= 1.--synthetic-input-tokens-mean <int>
: The mean of number of tokens in the generated prompts when prompt-source is synthetic, >= 1.--synthetic-input-tokens-stddev <int>
: The standard deviation of number of tokens in the generated prompts when prompt-source is synthetic, >= 0.--random-seed <int>
: The seed used to generate random values, >= 0.When the dataset is coming from HuggingFace, you can specify the following options:
--dataset {openorca,cnn_dailymail}
: HuggingFace dataset to use for benchmarking.--num-prompts <int>
: The number of unique prompts to generate as stimulus, >= 1.For any dataset, you can specify the following options:
--output-tokens-mean <int>
: The mean number of tokens in each output. Ensure the--tokenizer
value is set correctly, >= 1.--output-tokens-stddev <int>
: The standard deviation of the number of tokens in each output. This is only used whenoutput-tokens-mean
is provided, >= 1.--output-tokens-mean-deterministic
: When using--output-tokens-mean
, this flag can be set to improve precision by setting the minimum number of tokens equal to the requested number of tokens. This is currently supported with the Triton service-kind. Note that there is still some variability in the requested number of output tokens, but GenAi-Perf attempts its best effort with your model to get the right number of output tokens.You can optionally set additional model inputs with the following option:
--extra-inputs <input_name>:<value>
: An additional input for use with the model with a singular value, such asstream:true
ormax_tokens:5
. This flag can be repeated to supply multiple extra inputs.Metrics
GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.
Suggested labels
None