triton-inference-server / onnxruntime_backend

The Triton backend for the ONNX Runtime.
BSD 3-Clause "New" or "Revised" License
121 stars 54 forks source link

CPU Throttling when Deploying Triton with ONNX Backend on Kubernetes #245

Open langong347 opened 6 months ago

langong347 commented 6 months ago

Description I am deploying a YOLOv8 model for object-detection using Triton with ONNX backend on Kubernetes. I have experienced significant CPU throttling in the sidecar container ("queue-proxy") which sits in the same pod as the main container that runs Triton server.

CPU throttling is triggered by the onset of growing number of threads in the sidecar when traffic increases.

Meanwhile, there're 70-100 threads (for 1 or 2 model instances)) inside the main (w/ Triton) container as soon as the service is up, much higher than the typical number of threads without using Triton.

Allocating more CPU resource to the sidecar doesn't seen to be effective, which has suggested potential resource competition between the main (Triton) container and the sidecar, given the significantly high number of threads spun up by Triton.

My questions are:

  1. What causes Triton w/ ONNX backend to spin up so many threads? Is it due to the ONNX backend intra_op_thread_count ref that by default picks up all available CPU cores? Are all the threads active?
  2. Since my ONNX model runs on GPU, will using less intra_op_thread_count hurt performance?

Sidecar CPU Throttling (recurred over several deployments)

Screenshot 2024-03-01 at 12 33 14 PM

Container Threads (Upper: Main container w/ Triton, Lower: Sidecar)

Screenshot 2024-03-01 at 12 36 06 AM

Triton Information 23.12

Are you using the Triton container or did you build it yourself? Through KServe 0.11.2

To Reproduce Deployed the YOLOv8 model using KServe. The ONNX part of the model is executed by Triton as a runtime hosted in a separate pod. Inter-pod communication is handled by gRPC for sending preprocessed images to the pod containing Triton for prediction.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). YOLOv8 model written in PyTorch exported to ONNX.

My config.pbtxt

name: "my-service-name"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
dynamic_batching {
    max_queue_delay_microseconds: 20000
}

Expected behavior Sidecar CPU throttling should disappear after tuning down the number of threads spun up by Triton in the main container.

fpetrini15 commented 6 months ago

Hi @langong347,

Thank you for submitting an issue.

I notice your config does not set a different value for intra_op_thread_count, so yes, I believe the number of threads corresponds directly to the number of CPU cores. Whether or not these threads are active / have an impact on performance, it would depend on the workload.

Have you tried setting a value for intra_op_thread_count that is less than the number of CPU cores and checking its impact on performance / CPU throttling in your environment?

CC @whoisj for more thoughts.

langong347 commented 6 months ago

Hi @fpetrini15 thank you for your reply! I have the follow-up questions below:

  1. I have tried explicitly setting intra_op_thread_count = <cpu_limit_container> (the number of maximum cpu cores allowed for the main container running Triton). The decision of using cpu_limit_container is to make sure onnx only utilizes the CPUs available to the container instead of all cpus of the node hosting the container, based on this similar issue High CPU throttling when running torchscript inference with triton on high number cores node.

    a. While the thread count of my main container does fall from 70 to 40 (the exact number might be related to the exact value of intra_op_thread_count) with one model instance.

    b. However, CPU throttling of my main container has surged but the GPU utilization has fallen from 100% to 0%.

Screenshot 2024-03-06 at 1 22 29 PM Screenshot 2024-03-06 at 1 20 52 PM

Q: Is intra_op_thread_count only relevant when the model needs to be executed entirely on CPU?

ONNX documentation

For the default CPU execution provider, setting defaults are provided to get fast inference performance. 

This is how I configured intra_op_thread_count through KServe specification.

# KServe runs Triton inside a container, so below is the way to access the tritonserver command
containers:
     command:
          - tritonserver
      - args:
          - '--allow-http=true'
          - '--allow-grpc=true'
          - '--grpc-port=9000'
          - '--http-port=8080'
          - '--model-repository=<model_local_dir>'
          - '--log-verbose=2'
          - '--backend-config=onnxruntime,enable-global-threadpool=1'
          - '--backend-config=onnxruntime,intra_op_thread_count=<cpu_limit_container>'
          - '--backend-config=onnxruntime,inter_op_thread_count=<cpu_limit_container>'
  1. I wonder whether the 70-100 threads spun up in the main container (with Triton) is some sort of default thread pool used by Triton regardless of the backend?
  2. Should we configure multithreading options of backends via config.pbtxt? Is it the recommended way? Could you point me to the documentation on how to do so?
langong347 commented 5 months ago

Update: The issue with sidecar CPU throttling has been resolved by increasing CPU cores and memories for the sidecar container due to the large input size of image tensors. However, I am still confused by the worsening performance by explicitly setting onnx threads equal to the container CPU cores. Please let me know if you have any insights into the latter.

fpetrini15 commented 5 months ago

@langong347,

Doing some testing:

  1. I wonder whether the 70-100 threads spun up in the main container (with Triton) is some sort of default thread pool used by Triton regardless of the backend?

When starting Triton in explicit mode, loading no models, Triton consistently spawned 32 threads. When loading a simple onnx model, Triton consistently spawned 50 threads. When loading two instances of a simple onnx model, Triton consistently spawned 68 threads.

It is difficult to determine where each, exact thread might be coming from in your case, however, the number of threads you are reporting seem within reason for what is normally spawned, especially given your setup is more complicated that my testing environment.

  1. Should we configure multithreading options of backends via config.pbtxt? Is it the recommended way? Could you point me to the documentation on how to do so?

I believe you are referring to my earlier statement regarding adding intra_op_thread_count to you config.pbtxt? I made this statement before I knew you were adding it as a parameter to your launch command. Both approaches are fine.

CC: @tanmayv25 do you have any thoughts regarding why setting a limit on CPU cores would diminish GPU usage?

langong347 commented 5 months ago

Removed "sidecar" from the issue title as it is a separate issue. The open issue is CPU throttling with the main container after configuring ONNX op thread count.

whoisj commented 3 months ago

The only other relevant option I can think of might be --model-load-thread-count. Try setting it to something like 4.

@langong347 , do you know how many logical CPU cores the node has? Asking because Triton can see the base hardware even though it is in a container, and that could be affecting the number of threads spawned. If there's any alignment between the number of threads and the node's core count, then we at least have a place to start looking for a solution. Thanks.