Open artmatsak opened 3 years ago
Will investigate this and report back on my findings
Hi @artmatsak, Here is my experiment about the CPU performance issue with certain bert model (bert-squad):
Approach | ms/sec |
---|---|
Standalone ORT perf_test GPU | 15.12ms |
Triton r21.08 GPU | 13.7ms |
Standalone ORT perf_test CPU | 223.168ms |
Triton r21.08 CPU | 666ms |
Triton r21.08 CPU (remove thread=1) | 227ms |
Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: https://github.com/triton-inference-server/onnxruntime_backend/pull/67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.
However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!
I use Triton ORT 21.09-py3 I also have the same problem too. When I run pref the cpu is about 100%, but the QPS is not have a ideal accelerate.
@GuoGuiRong Are you comparing ORT with Triton-ORT? Can you add more details regarding
Hi, I'm getting the same issue, running ORT directly is about 3x faster. I am using HuggingFace transformers.onnx library to convert the model to ORT, and run it using the onnxruntime python client lib.
For the Triton model config this: `name: "paraphrase-MiniLM-L6-v2" platform: "onnxruntime_onnx" max_batch_size: 0
input [ { name: "input_ids" data_type: TYPE_INT64 dims: [-1,-1] }, { name: "token_type_ids" data_type: TYPE_INT64 dims: [-1,-1] }, { name: "attention_mask" data_type: TYPE_INT64 dims: [-1,-1] } ]
output { name: "last_hidden_state" data_type: TYPE_FP32 dims: [-1,-1,-1] }`
I have tried the various optimisation parameters suggested in the backend repo, but these seem to make the performance worse.
Is there any update for this issue?
Hi @artmatsak, Here is my experiment about the CPU performance issue with certain bert model (bert-squad):
Approach ms/sec Standalone ORT perf_test GPU 15.12ms Triton r21.08 GPU 13.7ms Standalone ORT perf_test CPU 223.168ms Triton r21.08 CPU 666ms Triton r21.08 CPU (remove thread=1) 227ms Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: #67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.
However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!
Thanks for sharing your experiences. In your table, there is 2 ms difference between the inference time of the model on GPU for triton and also ORT directly. Does anybody know why this difference existed?
Hi, I found this this is still exists on Triton 22.10 version. Does any one have any idea to have a solution or work around?
Maybe related #265 (comment) ?
Can anyone do a help on this issue?
Description Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.
Triton Information 21.02
Are you using the Triton container or did you build it yourself? Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.
To Reproduce
I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.
Expected behavior The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.