triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.99k stars 1.44k forks source link

Newer versions of triton server have a consirable slowdown in start time #6014

Open jnlarrain opened 1 year ago

jnlarrain commented 1 year ago

Description At my work we are currently working and deploying our models with triton server 22.05, we were willing to change to 23.05 when we realize the start up of the new version is more than 3x slower with onnx with TRT optimization flag and 16x with offline TRT optimization. After some testing we determinate 2 different architectures that are having troubles(the start time is considerable longer), from 5 models deployed in the same container.

The average startup time table of 10 runs for onnx models with TRT optimization flag is:

version seconds
22.05 205
23.01 422
23.05 750

The time table of the average startup time for TRT models is:

version seconds
22.05 4
23.05 65

Triton Information What version of Triton are you using?

For purpose of this test 22.05, 23.01 and 23.05. We want to focus in 22.05 and 23.05.

Are you using the Triton container or did you build it yourself?

We build the containers using the compose.py script in the repository python3 compose.py --backend tensorrt --backend pytorch --backend dali --backend=onnxruntime --repoagent checksum --enable-gpu

To Reproduce We run this tests in 3 benchmark machines all of them setup with Ubuntu 20.04 server and NVIDIA driver 530.41 one with 10XX geforce series card, another with a 20XX RTX series card and finally one with 30XX RTX series card, all of the with the same problem. For the version 22.05 we start the trinton server with--strict-readiness true --strict-model-config true For the version 23.01 and 23.05 we start the trinton server with --strict-readiness true --disable-auto-complete-config

for the offline TRT optimization we use the nvidia pytorch containers with the same version i.e. triton 22.05 with pytorch 22.05

The onnx with TRT optimization config file looks like:

name: "example_model"
platform: "onnxruntime_onnx"
default_model_filename: "model.onnx"
max_batch_size : 4
dynamic_batching {}
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 256, 256 ]
  }
]
output [
  {
    name: "output_1"
    data_type: TYPE_FP32
    dims: [ 3, 256, 256 ]
  }
]
optimization {
  execution_accelerators {
    gpu_execution_accelerator : [
      {
        name : "tensorrt"
      }
    ]
  }
}
model_warmup {
  name: "warmup_1"
  batch_size: 1
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}
model_warmup {
  name: "warmup_2"
  batch_size: 2
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}
model_warmup {
  name: "warmup_3"
  batch_size: 3
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}
model_warmup {
  name: "warmup_4"
  batch_size: 4
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}

Meanwhile for the offline TRT config looks like:

name: "example_model"
platform: "tensorrt_plan"
default_model_filename: "model.trt"
max_batch_size : 4
dynamic_batching {}
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 256, 256 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 3, 256, 256 ]
  }
]

Unfortunately I can not share any model, but if it is need I can search for opensource models that have the same issue. Please let me know if there is any other information needed.

Expected behavior The startup time for different Triton server version are similar

oandreeva-nv commented 1 year ago

@jnlarrain Thank you for reporting this issue. Would you be so kind and provide us opensource models with similar issues to reproduce? This would help us a lot and potentially speed up debugging time.

jnlarrain commented 1 year ago

@oandreeva-nv thanks for your response, this opensource model architecture can be used to replicated the issue. You will need to run the onnx file step. I hope this is useful

oandreeva-nv commented 1 year ago

Thanks! We'll look into this.

oandreeva-nv commented 1 year ago

@jnlarrain Could you please share information on how you measured the start time? I would like to make sure I can reproduce the issue with the same steps

jnlarrain commented 1 year ago

@oandreeva-nv Sure, I measure the time using the health endpoint here a pseudo code in python:

from time import time, sleep
from os import system
import requests

system("docker kill redact-infer")
t = time()

system("docker run --rm -d --name test-triton-server--shm-size=1g --gpus device=0 -p 8000-8002:8000-8002 <triton_image_name> ")

while True:
        sleep(1)
        try:
                req = requests.get("http://localhost:8000/v2/health/ready")
                break
        except:
                continue
print(time()-t)
jnlarrain commented 1 year ago

@oandreeva-nv just did want to check how is going, can I support with any logs or something else?

oandreeva-nv commented 1 year ago

@jnlarrain Apologies for the long wait. I'll prioritize this issue this week and let you know in case I need something else

oandreeva-nv commented 1 year ago

Hi @jnlarrain , I ran some experiments with the model you linked. I'll start with offline TRT model: I used TRT container I used 23.05 trt container to run trtexec to convert onnx model to use in triton 23.05 container, and I used 22.07 trt container to convert onnx model to run in 22.07 triton container (22.05 trt container wasn't able to convert onnx model).In this set up, I couldn't reproduce difference in start times. They both took around 4.5 sec on average across 10 runs.

Note: I measured start time using time : time tritonserver --model-repository=... and stopping server as soon as it is ready.

Regarding Onnx+TRT optimization flag: Starting with baseline, i.e. minimum config and no optimizations and warmups - I saw similar start times, no drastic differences. Same with added warmups.

I did see start up differences, when TRT optimization flag is enabled in the config file: 22.07 version was approx 3 times faster.

I am CC'ing @pranavsharma, hopefully he may have better insights why latter ONN Runtime+ TRT take longer time to build.

jnlarrain commented 1 year ago

@oandreeva-nv thanks for your response, did you manage to get the same results with the compose.py command shown in the issue? could be that I am missing one flag or miss something from the documentation?

oandreeva-nv commented 1 year ago

@jnlarrain I was using NGC server containers, but I can try compose.py . Meanwhile, you can also try optimizing onnx model with only basic optimizations, i.e. adding to your onnx config this:

optimization {
  graph : {
    level : -1
}}

Note, there is a PR, which will allow disabling all onnx optimizations. This way you can optimize onnx model offline and load an optimized model with disabled onnxruntime optimizations, this should speed up server's start time

jnlarrain commented 1 year ago

@oandreeva-nv sorry for the late response, the optimization change you suggested has similar start-up in both version, the average of 5 runs and loading 5 models is:

version seconds
22.05 9.1
23.05 12.5

I think this difference is fine, but I still can't get the same start up numbers you had with TRT models and the compose.py container, as I said before may I be missing a flag or something?

oandreeva-nv commented 1 year ago

@jnlarrain Thanks for the feedback. I didn't have a chance to test compose.py yet. I'll try to prioritize it next week.

jnlarrain commented 11 months ago

@oandreeva-nv any news with the compose.py at your side? or why on the fly TRT optimization for onnx is taking longer in the new versions?

oandreeva-nv commented 11 months ago

Hi @jnlarrain, I believe there is nothing special in compose.py that can affect TRT optimization for ONNX. May I ask you to file an issue on ONNX Runtime github or TensorRT issues page. Unfortunatelly, it doesn't seem to be a Triton specific issue.

jnlarrain commented 11 months ago

@oandreeva-nv thanks for your reply I will continue investigating this and post in the proper forum again once I can narrow the cause.