Open kangyeolk opened 1 year ago
Have you tried looking at your system metrics? When I tried reproducing this locally, my CPU utilization was ~100%. So it makes sense the improvement in throughput was marginal.
Higher batch sizes aren't guaranteed to have better performance, since it depends on your system capabilities and the model. That's why perf_analyzer and Model Analyzer exist: to help find the right configurations for your system and needs.
When using the perf_analyzer, I've seen the CPU usage hit 100%. However, I noticed that even with a 1-layer model and when utilizing all 64 cores, all 64 cores would consistently reach 100%. So, I initially thought that it was reserving the given resources to operate.
I'm currently trying various options of ORT (like OMP_WAIT_POLICY), but in every scenario, the CPU usage remains consistently at 100% (I wonder why).
Here is my shallow model:
import torch
simple_model = torch.nn.Conv2d(3, 1, 3, 1, 1)
x = torch.ones((2, 3, 32, 32)) # N x C x W
torch.onnx.export(
model,
x,
'model.onnx',
input_names=["input"],
dynamic_axes={"input": {0: "batch", 2: "width"}}
)
Do you see the same CPU usage on ONNX Runtime directly? Generally, GPUs are the way to go for deep learning models, but it would be helpful to know whether the issue is happening in the framework (ORT) or in Triton.
No, I've tested ORT without triton server but, it seems to have normal cpu utilization. My guess it that perf_analyzer
seems to be the issue consuming a lot of resources given that only 4 cores have been allocated to the container. Anther experiment where container was assigned 32 cores and the model used only 4 cores (i.e., 4 count of instance_group) seems to have an expected behavior with increasing max_batch_size
:
In my opinion, to solve this issue, we have to separately bind cores for doing perf_analyzer
and inference task respectively. Honestly, it doesn't seem like the most desirable approach. Any other thoughts?
I independently ran experiments without perf_analyzer (just sending inference requests repeatedly to the server), and I'm still seeing 100%+ CPU usage spikes.
Created a ticket (DLIS-5287) for us to investigate the high CPU usage.
CC: @matthewkotila for awareness.
Description I'm trying multiple
config.pbtxt
for serving models with local CPUs. Every other options work well, but I've found thatmax_batch_size
does not work as expected.What I expect is that as
max_batch_size
increases, the performance (e.g., RPS) would be better. But the performances were similar.Triton Information 23.05-py3
To Reproduce
Launch triton server
Load models with API with following configurations
For model, I used Pytorch pre-trained model:
perf_analyzer
After this process, I got similar performances: maximum RPS of resnet50-mbs1 (11.4) and maximum RPS of resnet50-mbs4 (11.6).
Expected behavior In my understanding, when
max_batch_size
increases, the performance would be much better even when using CPUs.