triton-inference-server / onnxruntime_backend

The Triton backend for the ONNX Runtime.
BSD 3-Clause "New" or "Revised" License
132 stars 57 forks source link

CPU inference is much slower than with ONNX Runtime directly #34

Open artmatsak opened 3 years ago

artmatsak commented 3 years ago

Description Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.

Triton Information 21.02

Are you using the Triton container or did you build it yourself? Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.

To Reproduce

I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.

Expected behavior The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.

askhade commented 3 years ago

Will investigate this and report back on my findings

jcwchen commented 3 years ago

Hi @artmatsak, Here is my experiment about the CPU performance issue with certain bert model (bert-squad):

Approach ms/sec
Standalone ORT perf_test GPU 15.12ms
Triton r21.08 GPU 13.7ms
Standalone ORT perf_test CPU 223.168ms
Triton r21.08 CPU 666ms
Triton r21.08 CPU (remove thread=1) 227ms

Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: https://github.com/triton-inference-server/onnxruntime_backend/pull/67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.

However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!

johnsGuo commented 3 years ago

I use Triton ORT 21.09-py3 I also have the same problem too. When I run pref the cpu is about 100%, but the QPS is not have a ideal accelerate.

askhade commented 2 years ago

@GuoGuiRong Are you comparing ORT with Triton-ORT? Can you add more details regarding

  1. ORT and Triton-ORT configs used durnig testing
  2. What perf diff are you seeing
bezdomniy commented 2 years ago

Hi, I'm getting the same issue, running ORT directly is about 3x faster. I am using HuggingFace transformers.onnx library to convert the model to ORT, and run it using the onnxruntime python client lib.

For the Triton model config this: `name: "paraphrase-MiniLM-L6-v2" platform: "onnxruntime_onnx" max_batch_size: 0

input [ { name: "input_ids" data_type: TYPE_INT64 dims: [-1,-1] }, { name: "token_type_ids" data_type: TYPE_INT64 dims: [-1,-1] }, { name: "attention_mask" data_type: TYPE_INT64 dims: [-1,-1] } ]

output { name: "last_hidden_state" data_type: TYPE_FP32 dims: [-1,-1,-1] }`

I have tried the various optimisation parameters suggested in the backend repo, but these seem to make the performance worse.

farzanehnakhaee70 commented 2 years ago

Is there any update for this issue?

farzanehnakhaee70 commented 2 years ago

Hi @artmatsak, Here is my experiment about the CPU performance issue with certain bert model (bert-squad):

Approach ms/sec Standalone ORT perf_test GPU 15.12ms Triton r21.08 GPU 13.7ms Standalone ORT perf_test CPU 223.168ms Triton r21.08 CPU 666ms Triton r21.08 CPU (remove thread=1) 227ms Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: #67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.

However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!

Thanks for sharing your experiences. In your table, there is 2 ms difference between the inference time of the model on GPU for triton and also ORT directly. Does anybody know why this difference existed?

hanswang1 commented 3 months ago

Hi, I found this this is still exists on Triton 22.10 version. Does any one have any idea to have a solution or work around?

Mitix-EPI commented 3 months ago

Maybe related https://github.com/triton-inference-server/onnxruntime_backend/issues/265#issue-2473275334 ?

hanswang1 commented 3 months ago

Maybe related #265 (comment) ?

Can anyone do a help on this issue?