ONNX Runtime performing worse than PyTorch on BERT CPU Inference

salimmj commented 4 years ago

Describe the bug

I run this notebook on my machine and I can not replicate the performance improvement results.

System information Macbook Pro:

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 2.6 GHz 6-Core Intel Core i7 EC2 Instance:
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.3.0
Python version: 3.6.8

To Reproduce

Run this notebook https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb

Expected behavior I expected results similar to the ones cached in the notebook:

PyTorch : 198.99 ms
ONNX : 176.96 ms
ONNX best model after grid search: 84.73 ms (with OMP_NUM_THREAD = 12, intra_num_threads = 1)
ONNX worse model: 306.75 ms (with OMP_NUM_THREAD = 1, intra_num_threads = 1)

But on my Mac I got:

PyTorch : 106.82 ms
ONNX : 265.84 ms
ONNX best model after grid search: 234.48 ms (with OMP_NUM_THREAD = 12, intra_num_threads = 1)
ONNX worse model: 273.45 ms (with OMP_NUM_THREAD = 1, intra_num_threads = 1)

And on EC2 :

PyTorch : 137.08 ms
ONNX : 189.88 ms
ONNX best model after grid search: 175.70 ms (with OMP_NUM_THREAD = 4, intra_num_threads = 1)
ONNX worse model: 267.25 ms (with OMP_NUM_THREAD = 1, intra_num_threads = 1)

Thanks in advance!

cc: @tianleiwu

tianleiwu commented 4 years ago

@fs-eire to help run the notebook in Macbook to see whether it could reproduce the issue.

salimmj commented 4 years ago

Additional note: ONNX performs slightly better than PyTorch (by ~5-10%) when intra_num_threads is unspecified.

tianleiwu commented 4 years ago

@salimmj, thanks for reporting the issue.

@fs-eire reproduced the issue in one mac notebook. It seems that onnxruntime is not built with openmp (looks like an issue of mac build pipeline).

The best setting for BERT on onnxruntime 1.3.0 and 1.4.0 for Mac: intra_num_threads = (unspecified, or number of physical/logical cores) and no need to set OMP_NUM_THREAD.

I'll update the notebook so that it could also get expected results on Mac.

salimmj commented 4 years ago

Thanks for looking into this. I also got better benchmarking results with intra_num_threads unspecified.

OMP_NUM_THREAD=1 is desirable when optimizing for overall throughput with multiple gRPC workers.

I will test this on EC2 with intra_num_threads unspecified to check that the issue is only on mac and comment the results.

fs-eire commented 4 years ago

Merged PR #4774. However this change will not apply to published packages. Need local build to reflect the change or wait for the next release.

slevental commented 3 years ago

Could anyone provide some details on how OpenMP improves performance vs threadpools that are used by default? I'm curious because Java API doesn't have openmp enabled in the native build, does it make sense to use OpenMP to improve throughput or tuning threadpools with intra/inter threads should be enought to achieve similar numbers?

tianleiwu commented 3 years ago

@slevental, the internal thread pools could achieve similar performance on most models we tested. OpenMP is slightly better on transformers (like BERT) models. Probably it is because we have tuned the optimization of BERT for OpenMP.

guotong1988 commented 3 years ago

What is the conclusion?

guotong1988 commented 3 years ago

Could anyone provide some details on how OpenMP improves performance vs threadpools that are used by default? I'm curious because Java API doesn't have openmp enabled in the native build, does it make sense to use OpenMP to improve throughput or tuning threadpools with intra/inter threads should be enought to achieve similar numbers?

What's wrong with https://github.com/microsoft/onnxruntime/issues/6031 ? Thank you!

microsoft / onnxruntime

ONNX Runtime performing worse than PyTorch on BERT CPU Inference #4654