Fix high concurrency generation throughput calculation

triton-inference-server / triton_cli

Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.

48 stars 2 forks source link

Fix high concurrency generation throughput calculation #16

Closed nv-hwoo closed 10 months ago

nv-hwoo commented 10 months ago

The expected output

$ triton model profile -m llama7b
Warming up...
Warmed up, profiling now...

[ PROFILE CONFIGURATIONS ]
 * Model: llama7b
 * Batch size: 32
 * Input tokens: 2048
 * Output tokens: 128

[ BENCHMARK SUMMARY ]
 * Avg first token latency: -- ms
 * Avg generation throughput: -- output tokens/s

cc @matthewkotila