triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.99k stars 1.44k forks source link

Compute output much larger than compute input #4691

Closed Vozf closed 1 year ago

Vozf commented 2 years ago

I have a model where perf_analyzer shows the following results

      Avg request latency: 753955 usec (overhead 32 usec + queue 78 usec + compute input 12555 usec + compute infer 215699 
usec + compute output 525591 usec)                                                                                         

The model is an image2image model with 1568x1568px input and 1568x1568 output. The model is running on gpu. As I understand compute input and output in that case should mainly consist of ram to gpu memory transfer and gpu memory to ram respectively The model is running as part of the ensemble with python cpu preprocessing -> torchscript gpu model -> python cpu postprocessing pipeline. Composing models perf_analyzer it it's relevant

  Composing models:                                                                                                        
  model, version:                                                                                                       
      Inference count: 45                                                                                                  
      Execution count: 45                                                                                                  
      Successful request count: 45                                                                                         
      Avg request latency: 753955 usec (overhead 32 usec + queue 78 usec + compute input 12555 usec + compute infer 215699 
usec + compute output 525591 usec)                                                                                         

  postprocess, version:                                                                                                    
      Inference count: 45                                                                                                  
      Execution count: 45
      Successful request count: 45
      Avg request latency: 275589 usec (overhead 21 usec + queue 66 usec + compute input 9434 usec + compute infer 265165 u
sec + compute output 902 usec) 

  preprocess, version: 
      Inference count: 45
      Execution count: 45
      Successful request count: 45
      Avg request latency: 103051 usec (overhead 17 usec + queue 48 usec + compute input 260 usec + compute infer 101177 us
ec + compute output 1548 usec) 

What are the possible reasons is compute input so little and compute output is so big even in compariosn to compute infer? Are there ways to decrease compute output?

GuanLuo commented 2 years ago

What Triton version are you using? And are you using Torch backend for the model that has large compute output time? The time stamp captured in Torch backend may not be exact due to the asynchronous nature of the execution (see detail), so part of the compute infer time may be shifted into compute output time. If you have an estimate on the execution time, could you compare with the compute infer time reported and see if that can be the possible cause?

Vozf commented 2 years ago

22.05, yes torch backend. Large compute output is stable and is always 2x of the compute infer. How can I get an estimate of the compute time? Do you mean the compute time which I get running torchscript module in plain python without triton?

GuanLuo commented 2 years ago

Do you mean the compute time which I get running torchscript module in plain python without triton?

That is correct

Vozf commented 2 years ago

I got 0.73s\request in plain torchscript, basically a little less than in triton. And that doesn't include gpu to cpu memory copy

GuanLuo commented 2 years ago

It seems like part of the compute output time is actually the infer time. @Tabrizian for the incoming change to use dedicated CUDA stream for model instance, Would it be possible to be more accurate on capturing compute end time?

jbkyang-nvi commented 1 year ago

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue