triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.2k stars 1.46k forks source link

[question] Total avg queue time = 0 usec for ensemble model #2429

Closed Slyne closed 3 years ago

Slyne commented 3 years ago

Description I used ensemble model, which consists of three models: 2 trt + 1pt + 1 custom backend. However the output is quite strange:

perf_analyzer -m quartznet-ensemble -b 1 --concurrency-range 500:600:100 --input-data=BAC009S0723W0151.json 
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 600 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 500
  Client: 
    Request count: 1206
    Throughput: 241.2 infer/sec
    Avg latency: 1889372 usec (standard deviation 5671 usec)
    p50 latency: 1888639 usec
    p90 latency: 1896658 usec
    p95 latency: 1897224 usec
    p99 latency: 1898173 usec
    Avg HTTP time: 1887031 usec (send/recv 237 usec + response wait 1886794 usec)
  Server: 
    Inference count: 1597
    Execution count: 1597
    Successful request count: 1597
    Avg request latency: 1885711 usec
    Total avg compute input time : 4184 usec
    Total avg compute infer time : 3291 usec
    Total avg compute output time : 281 usec
    Total avg queue time : 0 usec

  Composing models: 
  ctc-greedy-helper, version: 
      Inference count: 1597
      Execution count: 1597
      Successful request count: 1597
      Avg request latency: 538 usec (overhead 1 usec + queue 41 usec + compute input 1 usec + compute infer 495 usec + compute output 0 usec)

  greedy-decoder, version: 
      Inference count: 1597
      Execution count: 1597
      Successful request count: 1597
      Avg request latency: 868 usec (overhead 1 usec + queue 48 usec + compute input 310 usec + compute infer 327 usec + compute output 182 usec)

  jasper-feature-extractor, version: 
      Inference count: 1597
      Execution count: 1597
      Successful request count: 1597
      Avg request latency: 2412 usec (overhead 1 usec + queue 46 usec + compute input 127 usec + compute infer 2152 usec + compute output 86 usec)

  quartznet-trt, version: 
      Inference count: 1597
      Execution count: 1597
      Successful request count: 1597
      Avg request latency: 1881826 usec (overhead 1 usec + queue 1877709 usec + compute input 3744 usec + compute infer 365 usec + compute output 7 usec)

Request concurrency: 600
  Client: 
    Request count: 1193
    Throughput: 238.6 infer/sec
    Avg latency: 2210227 usec (standard deviation 3362 usec)
    p50 latency: 2209739 usec
    p90 latency: 2215668 usec
    p95 latency: 2216114 usec
    p99 latency: 2216736 usec
    Avg HTTP time: 2210211 usec (send/recv 246 usec + response wait 2209965 usec)
  Server: 
    Inference count: 1630
    Execution count: 1630
    Successful request count: 1630
    Avg request latency: 2208960 usec
    Total avg compute input time : 4151 usec
    Total avg compute infer time : 3366 usec
    Total avg compute output time : 240 usec
    Total avg queue time : 0 usec

  Composing models: 
  ctc-greedy-helper, version: 
      Inference count: 1630
      Execution count: 1630
      Successful request count: 1630
      Avg request latency: 505 usec (overhead 1 usec + queue 41 usec + compute input 1 usec + compute infer 462 usec + compute output 0 usec)

  greedy-decoder, version: 
      Inference count: 1631
      Execution count: 1631
      Successful request count: 1631
      Avg request latency: 905 usec (overhead 1 usec + queue 47 usec + compute input 356 usec + compute infer 338 usec + compute output 163 usec)

  jasper-feature-extractor, version: 
      Inference count: 1630
      Execution count: 1630
      Successful request count: 1630
      Avg request latency: 2413 usec (overhead 3 usec + queue 41 usec + compute input 126 usec + compute infer 2173 usec + compute output 70 usec)

  quartznet-trt, version: 
      Inference count: 1630
      Execution count: 1630
      Successful request count: 1630
      Avg request latency: 2205036 usec (overhead 2 usec + queue 2200973 usec + compute input 3668 usec + compute infer 387 usec + compute output 6 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 500, throughput: 241.2 infer/sec, latency 1889372 usec
Concurrency: 600, throughput: 238.6 infer/sec, latency 2210227 usec

Just wanna ask if the output is normal ? And what does this mean to have zero queue time ? (The average compute and infer time also doesn't seem right)

 Total avg compute input time : 4184 usec
    Total avg compute infer time : 3291 usec
    Total avg compute output time : 281 usec
    Total avg queue time : 0 usec

Triton Information What version of Triton are you using? 20.10

Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:20.10-py3

Expected behavior Expect the avg queue time to be a reasonable figure.

deadeyegoodwin commented 3 years ago

The avg queue time does seem wrong. Can you be more specific about what you think is wrong in the other times.

tanmayv25 commented 3 years ago

@Slyne I must fix the terminology of in the report. It is not actual total avg compute input, compute infer, compute output time and queue time; but the components seen by the ensemble scheduler. Because there is no queue in the ensemble and the incoming request is directly proceeds to the first step(first composing model) ,the queue time is reported as zero.

@deadeyegoodwin I should probably remove the Total term to prevent the confusion. These numbers will then be just for the model being loaded (quartznet-ensemble), which is an ensemble in this case followed by the composing models.