irthomasthomas / undecidability

7 stars 2 forks source link

Measuring inference speed metrics for hosted and local LLM #822

Open ShellLM opened 5 months ago

ShellLM commented 5 months ago

Measuring inference speed metrics for hosted and local LLM

Snippet

GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. For a full list of metrics please see the Metrics section.

Users specify a model name, an inference server URL, the type of inputs to use (synthetic or from dataset), and the type of load to generate (number of concurrent requests, request rate).

GenAI-Perf generates the specified load, measures the performance of the inference server and reports the metrics in a simple table as console output. The tool also logs all results in a csv file that can be used to derive additional metrics and visualizations. The inference server must already be running when GenAI-Perf is run.

[!Note] GenAI-Perf is currently in early release and under rapid development. While we will try to remain consistent, command line options and functionality are subject to change as the tool matures.

Installation

Triton SDK Container

Available starting with the 24.03 release of the Triton Server SDK container.

Run the Triton Inference Server SDK docker container:

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

docker run -it --net=host --gpus=all  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Run GenAI-Perf:

genai-perf --help

From Source

This method requires that Perf Analyzer is installed in your development environment and that you have at least Python 3.10 installed. To build Perf Analyzer from source, see here.

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

pip install "git+https://github.com/triton-inference-server/client.git@r${RELEASE}#subdirectory=src/c++/perf_analyzer/genai-perf"

Run GenAI-Perf:

genai-perf --help

Quick Start

Measuring Throughput and Latency of GPT2 using Triton + TensorRT-LLM

Running GPT2 on Triton Inference Server using TensorRT-LLM

See instructions

Running GenAI-Perf

Run Triton Inference Server SDK container:

export RELEASE="mm.yy" # e.g. export RELEASE="24.03"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Run GenAI-Perf:

genai-perf \
  -m gpt2 \
  --service-kind triton \
  --backend tensorrtllm \
  --prompt-source synthetic \
  --num-prompts 100 \
  --random-seed 123 \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --streaming \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --output-tokens-mean-deterministic \
  --tokenizer hf-internal-testing/llama-tokenizer \
  --concurrency 1 \
  --measurement-interval 4000 \
  --profile-export-file my_profile_export.json \
  --url localhost:8001

Example output:

                                                  LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃                Statistic ┃         avg ┃         min ┃         max ┃         p99 ┃         p90 ┃         p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Time to first token (ns) │  13,266,974 │  11,818,732 │  18,351,779 │  16,513,479 │  13,741,986 │  13,544,376 │
│ Inter token latency (ns) │   2,069,766 │      42,023 │  15,307,799 │   3,256,375 │   3,020,580 │   2,090,930 │
│     Request latency (ns) │ 223,532,625 │ 219,123,330 │ 241,004,192 │ 238,198,306 │ 229,676,183 │ 224,715,918 │
│         Num output token │         104 │         100 │         129 │         128 │         109 │         105 │
│          Num input token │         199 │         199 │         199 │         199 │         199 │         199 │
└──────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
Output token throughput (per sec): 460.42
Request throughput (per sec): 4.44

See Tutorial for additional examples.

Model Inputs

GenAI-Perf supports model input prompts from either synthetically generated inputs, or from the HuggingFace OpenOrca or CNN_DailyMail datasets. This is specified using the --input-dataset CLI option.

When the dataset is synthetic, you can specify the following options:

When the dataset is coming from HuggingFace, you can specify the following options:

For any dataset, you can specify the following options:

You can optionally set additional model inputs with the following option:

Metrics

GenAI-Perf collects a diverse set of metrics that captures the performance of the inference server.

Metric Description Aggregations
Time to First Token Time between when a request is sent and when its first response is received, one value per request in benchmark Avg, min, max, p99, p90, p75
Inter Token Latency Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark Avg, min, max, p99, p90, p75
Request Latency Time between when a request is sent and when its final response is received, one value per request in benchmark Avg, min, max, p99, p90, p75
Number of Output Tokens Total number of output tokens of a request, one value per request in benchmark Avg, min, max, p99, p90, p75
Output Token Throughput Total number of output tokens from benchmark divided by benchmark duration None–one value per benchmark
Request Throughput Number of final responses from benchmark divided by benchmark duration None–one value per benchmark

Suggested labels

None

ShellLM commented 5 months ago

Related content

408 similarity score: 0.89

690 similarity score: 0.88

649 similarity score: 0.87

324 similarity score: 0.86

498 similarity score: 0.86

811 similarity score: 0.86