Closed Vozf closed 1 year ago
What Triton version are you using? And are you using Torch backend for the model that has large compute output time? The time stamp captured in Torch backend may not be exact due to the asynchronous nature of the execution (see detail), so part of the compute infer time may be shifted into compute output time. If you have an estimate on the execution time, could you compare with the compute infer time reported and see if that can be the possible cause?
22.05, yes torch backend. Large compute output is stable and is always 2x of the compute infer. How can I get an estimate of the compute time? Do you mean the compute time which I get running torchscript module in plain python without triton?
Do you mean the compute time which I get running torchscript module in plain python without triton?
That is correct
I got 0.73s\request in plain torchscript, basically a little less than in triton. And that doesn't include gpu to cpu memory copy
It seems like part of the compute output time is actually the infer time. @Tabrizian for the incoming change to use dedicated CUDA stream for model instance, Would it be possible to be more accurate on capturing compute end time?
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue
I have a model where perf_analyzer shows the following results
The model is an image2image model with 1568x1568px input and 1568x1568 output. The model is running on gpu. As I understand compute input and output in that case should mainly consist of ram to gpu memory transfer and gpu memory to ram respectively The model is running as part of the ensemble with python cpu preprocessing -> torchscript gpu model -> python cpu postprocessing pipeline. Composing models perf_analyzer it it's relevant
What are the possible reasons is compute input so little and compute output is so big even in compariosn to compute infer? Are there ways to decrease compute output?