triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Worse performance with higher max_batch_size? #6141

Open kangyeolk opened 1 year ago

kangyeolk commented 1 year ago

Description I'm trying multiple config.pbtxt for serving models with local CPUs. Every other options work well, but I've found that max_batch_size does not work as expected.

What I expect is that as max_batch_size increases, the performance (e.g., RPS) would be better. But the performances were similar.

Triton Information 23.05-py3

To Reproduce

  1. Launch triton server

    docker run -d --cpuset-cpus=0-3 --memory 8g --rm -p8000:8000 -p8001:8001 -p8002:8002 -v path//to//model:/models nvcr.io/nvidia/tritonserver:23.05-py3 tritonserver --model-control-mode=explicit --model-repository=/models
  2. Load models with API with following configurations

For model, I used Pytorch pre-trained model:

import torch
from torchvision.models import resnet50, ResNet50_Weights

# Old weights with accuracy 76.130%
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)

x = torch.ones((32, 3, 224, 224))  # N x C x W

torch.onnx.export(
    model,
    x,
    'model.onnx',
    input_names=["input"],
    dynamic_axes={"input": {0: "batch", 2: "width"}}
)
name: "resnet50-mbs1"
platform: "onnxruntime_onnx"
max_batch_size : 1

dynamic_batching {
    max_queue_delay_microseconds: 100
}

input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 3, 224, 224 ] }
  }
]

version_policy: { latest: { num_versions: 1}}

instance_group [
  {
    count: 1
    name: "cpu_instance"
    kind: KIND_CPU
  }
]

parameters { key: "intra_op_thread_count" value: { string_value: "1" } }
parameters { key: "execution_mode" value: { string_value: "0" } }
parameters { key: "inter_op_thread_count" value: { string_value: "1" } }
name: "resnet50-mbs4"
platform: "onnxruntime_onnx"
max_batch_size : 4

dynamic_batching {
    max_queue_delay_microseconds: 100
}

input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 3, 224, 224 ] }
  }
]

version_policy: { latest: { num_versions: 1}}

instance_group [
  {
    count: 1
    name: "cpu_instance"
    kind: KIND_CPU
  }
]

parameters { key: "intra_op_thread_count" value: { string_value: "1" } }
parameters { key: "execution_mode" value: { string_value: "0" } }
parameters { key: "inter_op_thread_count" value: { string_value: "1" } }
  1. Test the performances with perf_analyzer
perf_analyzer -m resnet50-mbs1 --shape images:3,224,224 --concurrency-range 1:40:3 -f ./logs/cpus4_mbs1_thr1_igc1.csv --measurement-interval=5000

perf_analyzer -m resnet50-mbs4 --shape images:3,224,224 --concurrency-range 1:40:3 -f ./logs/cpus4_mbs1_thr1_igc1.csv --measurement-interval=5000

After this process, I got similar performances: maximum RPS of resnet50-mbs1 (11.4) and maximum RPS of resnet50-mbs4 (11.6).

Expected behavior In my understanding, when max_batch_size increases, the performance would be much better even when using CPUs.

dyastremsky commented 1 year ago

Have you tried looking at your system metrics? When I tried reproducing this locally, my CPU utilization was ~100%. So it makes sense the improvement in throughput was marginal.

Higher batch sizes aren't guaranteed to have better performance, since it depends on your system capabilities and the model. That's why perf_analyzer and Model Analyzer exist: to help find the right configurations for your system and needs.

kangyeolk commented 1 year ago

When using the perf_analyzer, I've seen the CPU usage hit 100%. However, I noticed that even with a 1-layer model and when utilizing all 64 cores, all 64 cores would consistently reach 100%. So, I initially thought that it was reserving the given resources to operate.

스크린샷 2023-07-27 오후 9 28 33

I'm currently trying various options of ORT (like OMP_WAIT_POLICY), but in every scenario, the CPU usage remains consistently at 100% (I wonder why).

Here is my shallow model:

import torch

simple_model = torch.nn.Conv2d(3, 1, 3, 1, 1)
x = torch.ones((2, 3, 32, 32))  # N x C x W

torch.onnx.export(
    model,
    x,
    'model.onnx',
    input_names=["input"],
    dynamic_axes={"input": {0: "batch", 2: "width"}}
)
dyastremsky commented 1 year ago

Do you see the same CPU usage on ONNX Runtime directly? Generally, GPUs are the way to go for deep learning models, but it would be helpful to know whether the issue is happening in the framework (ORT) or in Triton.

kangyeolk commented 1 year ago

No, I've tested ORT without triton server but, it seems to have normal cpu utilization. My guess it that perf_analyzer seems to be the issue consuming a lot of resources given that only 4 cores have been allocated to the container. Anther experiment where container was assigned 32 cores and the model used only 4 cores (i.e., 4 count of instance_group) seems to have an expected behavior with increasing max_batch_size:

image

In my opinion, to solve this issue, we have to separately bind cores for doing perf_analyzer and inference task respectively. Honestly, it doesn't seem like the most desirable approach. Any other thoughts?

dyastremsky commented 1 year ago

I independently ran experiments without perf_analyzer (just sending inference requests repeatedly to the server), and I'm still seeing 100%+ CPU usage spikes.

Created a ticket (DLIS-5287) for us to investigate the high CPU usage.

CC: @matthewkotila for awareness.