triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.08k stars 1.45k forks source link

A fluctuating result is obtained when perf_analyze is run for a pressure test #7436

Open LinGeLin opened 2 months ago

LinGeLin commented 2 months ago

Description I used the latest image version 24.06 because the corresponding latest version of trt has support for BF16. But when I deploy the model with trt-backend. I used perf_analyze to pressure test the model service and got a fluctuating result.

Triton Information 2.47.0

Are you using the Triton container or did you build it yourself?

image version 24.06

To Reproduce perf_analyze

perf_analyzer --concurrency-range 1:8  -p 5000  --latency-threshold 300 -f perf.csv -m my_model_name -i grpc --request-distribution poisson -b 256 -u localhost:6601  --percentile 99 --input-data=random

My pressure test results:

*** Measurement Settings ***
  Batch size: 256
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 300 msec
  Concurrency limit: 8 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p99 latency

Request concurrency: 1
  Client:
    Request count: 1299
    Throughput: 18473.3 infer/sec
    p50 latency: 13806 usec
    p90 latency: 13945 usec
    p95 latency: 14248 usec
    p99 latency: 14610 usec
    Avg gRPC time: 13836 usec ((un)marshal request/response 1300 usec + response wait 12536 usec)
  Server:
    Inference count: 332544
    Execution count: 1299
    Successful request count: 1299
    Avg request latency: 11282 usec (overhead 34 usec + queue 28 usec + compute input 2846 usec + compute infer 8318 usec + compute output 55 usec)

Request concurrency: 2
  Client:
    Request count: 1611
    Throughput: 22910.6 infer/sec
    p50 latency: 22316 usec
    p90 latency: 22440 usec
    p95 latency: 22488 usec
    p99 latency: 22598 usec
    Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
  Server:
    Inference count: 412416
    Execution count: 1611
    Successful request count: 1611
    Avg request latency: 19400 usec (overhead 37 usec + queue 8099 usec + compute input 2840 usec + compute infer 8327 usec + compute output 96 usec)

Request concurrency: 3
  Client:
    Request count: 1091
    Throughput: 15515.2 infer/sec
    p50 latency: 49428 usec
    p90 latency: 49735 usec
    p95 latency: 50021 usec
    p99 latency: 54494 usec
    Avg gRPC time: 49517 usec ((un)marshal request/response 1346 usec + response wait 48171 usec)
  Server:
    Inference count: 279296
    Execution count: 727
    Successful request count: 1091
    Avg request latency: 46345 usec (overhead 119 usec + queue 20338 usec + compute input 3312 usec + compute infer 22479 usec + compute output 96 usec)

Request concurrency: 4
  Client:
    Request count: 2135
    Throughput: 30362.8 infer/sec
    p50 latency: 33672 usec
    p90 latency: 33822 usec
    p95 latency: 33867 usec
    p99 latency: 33992 usec
    Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
  Server:
    Inference count: 546560
    Execution count: 1068
    Successful request count: 2135
    Avg request latency: 30395 usec (overhead 153 usec + queue 13290 usec + compute input 3549 usec + compute infer 13301 usec + compute output 101 usec)

Request concurrency: 5
  Client:
    Request count: 2136
    Throughput: 30377 infer/sec
    p50 latency: 36885 usec
    p90 latency: 50683 usec
    p95 latency: 50778 usec
    p99 latency: 51032 usec
    Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
  Server:
    Inference count: 546816
    Execution count: 1068
    Successful request count: 2136
    Avg request latency: 38631 usec (overhead 154 usec + queue 21520 usec + compute input 3572 usec + compute infer 13285 usec + compute output 99 usec)

Request concurrency: 6
  Client:
    Request count: 2136
    Throughput: 30377 infer/sec
    p50 latency: 50544 usec
    p90 latency: 50729 usec
    p95 latency: 50806 usec
    p99 latency: 50961 usec
    Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
  Server:
    Inference count: 546816
    Execution count: 1068
    Successful request count: 2136
    Avg request latency: 47023 usec (overhead 171 usec + queue 29900 usec + compute input 3580 usec + compute infer 13271 usec + compute output 100 usec)

Request concurrency: 7
  Client:
    Request count: 1497
    Throughput: 21289.7 infer/sec
    p50 latency: 84223 usec
    p90 latency: 84519 usec
    p95 latency: 84635 usec
    p99 latency: 87573 usec
    Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
  Server:
    Inference count: 383232
    Execution count: 855
    Successful request count: 1497
    Avg request latency: 80567 usec (overhead 157 usec + queue 59422 usec + compute input 3555 usec + compute infer 17334 usec + compute output 98 usec)

Request concurrency: 8
  Client:
    Request count: 2130
    Throughput: 30291.3 infer/sec
    p50 latency: 67560 usec
    p90 latency: 67819 usec
    p95 latency: 67898 usec
    p99 latency: 68080 usec
    Avg gRPC time: 67562 usec ((un)marshal request/response 1426 usec + response wait 66136 usec)
  Server:
    Inference count: 545280
    Execution count: 1065
    Successful request count: 2130
    Avg request latency: 64059 usec (overhead 178 usec + queue 46884 usec + compute input 3670 usec + compute infer 13226 usec + compute output 101 usec)

Concurrency: 1, throughput: 18473.3 infer/sec, latency 14610 usec
Concurrency: 2, throughput: 22910.6 infer/sec, latency 22598 usec
Concurrency: 3, throughput: 15515.2 infer/sec, latency 54494 usec
Concurrency: 4, throughput: 30362.8 infer/sec, latency 33992 usec
Concurrency: 5, throughput: 30377 infer/sec, latency 51032 usec
Concurrency: 6, throughput: 30377 infer/sec, latency 50961 usec
Concurrency: 7, throughput: 21289.7 infer/sec, latency 87573 usec
Concurrency: 8, throughput: 30291.3 infer/sec, latency 68080 usec

You can see that the throughput drops significantly when the concurrency is 3 or 7. This seems very strange. Does anyone know a possible cause.

Some Settings in config.pbtxt:

max_batch_size: 512

instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 256
  max_queue_delay_microseconds: 100
}
optimization {
  cuda {
    busy_wait_events: true
    output_copy_stream: true
  }
}

Expected behavior Is there a statistical problem with the time taken? Or is there a configuration problem? Hope to see a more stable outcome

rmccorm4 commented 1 month ago

CC @matthewkotila @nv-hwoo if you have any thoughts on the variance or improvements to the provided PA arguments

matthewkotila commented 1 month ago

I don't have any concrete ideas on why this would be happening.

@LinGeLin have you tried re-running the entire experiment multiple times to confirm that it consistently shows degraded performance for concurrencies 3 and 7? Perhaps you'll want to decrease the stability percentage (-s)? And/or increase the measurement window (--measurement-interval)?