triton-inference-server / triton_cli

Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.
48 stars 2 forks source link

Add profile subcommand to run perf analyzer #13

Closed matthewkotila closed 10 months ago

matthewkotila commented 10 months ago

Example output:

$ triton model profile -m llama
pull engine()
run_server()
profile()
Warming up...
Warmed up, profiling now...
[ BENCHMARK SUMMARY ]
Prompt size: --
  * Max first token latency: -- ms
  * Min first token latency: -- ms
  * Avg first token latency: -- ms
  * p50 first token latency: -- ms
  * p90 first token latency: -- ms
  * p95 first token latency: -- ms
  * p99 first token latency: -- ms
  * Max generation latency: -- ms
  * Min generation latency: -- ms
  * Avg generation latency: -- ms
  * p50 generation latency: -- ms
  * p90 generation latency: -- ms
  * p95 generation latency: -- ms
  * p99 generation latency: -- ms
  * Avg output token latency: -- ms/output token
  * Avg total token-to-token latency: -- ms
  * Max end-to-end latency: -- ms
  * Min end-to-end latency: -- ms
  * Avg end-to-end latency: -- ms
  * p50 end-to-end latency: -- ms
  * p90 end-to-end latency: -- ms
  * p95 end-to-end latency: -- ms
  * p99 end-to-end latency: -- ms
  * Max end-to-end throughput: -- tokens/s
  * Min end-to-end throughput: -- tokens/s
  * Avg end-to-end throughput: -- tokens/s
  * p50 end-to-end throughput: -- tokens/s
  * p90 end-to-end throughput: -- tokens/s
  * p95 end-to-end throughput: -- tokens/s
  * p99 end-to-end throughput: -- tokens/s
  * Max generation throughput: -- output tokens/s
  * Min generation throughput: -- output tokens/s
  * Avg generation throughput: -- output tokens/s
  * p50 generation throughput: -- output tokens/s
  * p90 generation throughput: -- output tokens/s
  * p95 generation throughput: -- output tokens/s
  * p99 generation throughput: -- output tokens/s

cc @nv-hwoo