[ ] Measuring inference speed metrics for hosted and local LLM

Measuring inference speed metrics for hosted and local LLM

Snippet

GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. For a full list of metrics please see the Metrics section.

Users specify a model name, an inference server URL, the type of inputs to use (synthetic or from dataset), and the type of load to generate (number of concurrent requests, request rate).

GenAI-Perf generates the specified load, measures the performance of the inference server and reports the metrics in a simple table as console output. The tool also logs all results in a csv file that can be used to derive additional metrics and visualizations. The inference server must already be running when GenAI-Perf is run.

[!Note] GenAI-Perf is currently in early release and under rapid development. While we will try to remain consistent, command line options and functionality are subject to change as the tool matures.

Installation

Triton SDK Container

Available starting with the 24.03 release of the Triton Server SDK container.

Run the Triton Inference Server SDK docker container:

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

docker run -it --net=host --gpus=all  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Run GenAI-Perf:

genai-perf --help

From Source

This method requires that Perf Analyzer is installed in your development environment and that you have at least Python 3.10 installed. To build Perf Analyzer from source, see here.

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

pip install "git+https://github.com/triton-inference-server/client.git@r${RELEASE}#subdirectory=src/c++/perf_analyzer/genai-perf"

Run GenAI-Perf:

genai-perf --help

Quick Start

Measuring Throughput and Latency of GPT2 using Triton + TensorRT-LLM

Running GPT2 on Triton Inference Server using TensorRT-LLM

See instructions

Running GenAI-Perf

Run Triton Inference Server SDK container:

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Run GenAI-Perf:

genai-perf \
  -m gpt2 \
  --service-kind triton \
  --backend tensorrtllm \
  --prompt-source synthetic \
  --num-prompts 100 \
  --random-seed 123 \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --streaming \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --output-tokens-mean-deterministic \
  --tokenizer hf-internal-testing/llama-tokenizer \
  --concurrency 1 \
  --measurement-interval 4000 \
  --profile-export-file my_profile_export.json \
  --url localhost:8001

Example output:

                                                  LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃                Statistic ┃         avg ┃         min ┃         max ┃         p99 ┃         p90 ┃         p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Time to first token (ns) │  13,266,974 │  11,818,732 │  18,351,779 │  16,513,479 │  13,741,986 │  13,544,376 │
│ Inter token latency (ns) │   2,069,766 │      42,023 │  15,307,799 │   3,256,375 │   3,020,580 │   2,090,930 │
│     Request latency (ns) │ 223,532,625 │ 219,123,330 │ 241,004,192 │ 238,198,306 │ 229,676,183 │ 224,715,918 │
│         Num output token │         104 │         100 │         129 │         128 │         109 │         105 │
│          Num input token │         199 │         199 │         199 │         199 │         199 │         199 │
└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
Output token throughput (per sec): 460.42
Request throughput (per sec): 4.44

See Tutorial for additional examples.

Model Inputs

GenAI-Perf supports model input prompts from either synthetically generated inputs, or from the HuggingFace OpenOrca or CNN_DailyMail datasets. This is specified using the --input-dataset CLI option.

When the dataset is synthetic, you can specify the following options:

--num-prompts <int>: The number of unique prompts to generate as stimulus, >= 1.
--synthetic-input-tokens-mean <int>: The mean of number of tokens in the generated prompts when prompt-source is synthetic, >= 1.
--synthetic-input-tokens-stddev <int>: The standard deviation of number of tokens in the generated prompts when prompt-source is synthetic, >= 0.
--random-seed <int>: The seed used to generate random values, >= 0.

When the dataset is coming from HuggingFace, you can specify the following options:

--dataset {openorca,cnn_dailymail}: HuggingFace dataset to use for benchmarking.
--num-prompts <int>: The number of unique prompts to generate as stimulus, >= 1.

For any dataset, you can specify the following options:

--output-tokens-mean <int>: The mean number of tokens in each output. Ensure the --tokenizer value is set correctly, >= 1.
--output-tokens-stddev <int>: The standard deviation of the number of tokens in each output. This is only used when output-tokens-mean is provided, >= 1.
--output-tokens-mean-deterministic: When using --output-tokens-mean, this flag can be set to improve precision by setting the minimum number of tokens equal to the requested number of tokens. This is currently supported with the Triton service-kind. Note that there is still some variability in the requested number of output tokens, but GenAi-Perf attempts its best effort with your model to get the right number of output tokens.

You can optionally set additional model inputs with the following option:

--extra-inputs <input_name>:<value>: An additional input for use with the model with a singular value, such as stream:true or max_tokens:5. This flag can be repeated to supply multiple extra inputs.

Metrics

GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.

Metric	Description	Aggregations
Time to First Token	Time between when a request is sent and when its first response is received, one value per request in benchmark	Avg, min, max, p99, p90, p75
Inter Token Latency	Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark	Avg, min, max, p99, p90, p75
Request Latency	Time between when a request is sent and when its final response is received, one value per request in benchmark	Avg, min, max, p99, p90, p75
Number of Output Tokens	Total number of output tokens of a request, one value per request in benchmark	Avg, min, max, p99, p90, p75
Output Token Throughput	Total number of output tokens from benchmark divided by benchmark duration	None–one value per benchmark
Request Throughput	Number of final responses from benchmark divided by benchmark duration	None–one value per benchmark

irthomasthomas / undecidability