Open salimmj opened 4 years ago
@fs-eire to help run the notebook in Macbook to see whether it could reproduce the issue.
Additional note: ONNX performs slightly better than PyTorch (by ~5-10%) when intra_num_threads
is unspecified.
@salimmj, thanks for reporting the issue.
@fs-eire reproduced the issue in one mac notebook. It seems that onnxruntime is not built with openmp (looks like an issue of mac build pipeline).
The best setting for BERT on onnxruntime 1.3.0 and 1.4.0 for Mac: intra_num_threads = (unspecified, or number of physical/logical cores) and no need to set OMP_NUM_THREAD.
I'll update the notebook so that it could also get expected results on Mac.
Thanks for looking into this. I also got better benchmarking results with intra_num_threads
unspecified.
OMP_NUM_THREAD=1
is desirable when optimizing for overall throughput with multiple gRPC workers.
I will test this on EC2 with intra_num_threads
unspecified to check that the issue is only on mac and comment the results.
Merged PR #4774. However this change will not apply to published packages. Need local build to reflect the change or wait for the next release.
Could anyone provide some details on how OpenMP improves performance vs threadpools that are used by default? I'm curious because Java API doesn't have openmp enabled in the native build, does it make sense to use OpenMP to improve throughput or tuning threadpools with intra/inter threads should be enought to achieve similar numbers?
@slevental, the internal thread pools could achieve similar performance on most models we tested. OpenMP is slightly better on transformers (like BERT) models. Probably it is because we have tuned the optimization of BERT for OpenMP.
What is the conclusion?
Could anyone provide some details on how OpenMP improves performance vs threadpools that are used by default? I'm curious because Java API doesn't have openmp enabled in the native build, does it make sense to use OpenMP to improve throughput or tuning threadpools with intra/inter threads should be enought to achieve similar numbers?
What's wrong with https://github.com/microsoft/onnxruntime/issues/6031 ? Thank you!
Describe the bug
I run this notebook on my machine and I can not replicate the performance improvement results.
System information Macbook Pro:
To Reproduce
Expected behavior I expected results similar to the ones cached in the notebook:
198.99 ms
176.96 ms
84.73 ms
(withOMP_NUM_THREAD = 12
,intra_num_threads = 1
)306.75 ms
(withOMP_NUM_THREAD = 1
,intra_num_threads = 1
)But on my Mac I got:
106.82 ms
265.84 ms
234.48 ms
(withOMP_NUM_THREAD = 12
,intra_num_threads = 1
)273.45 ms
(withOMP_NUM_THREAD = 1
,intra_num_threads = 1
)And on EC2 :
137.08 ms
189.88 ms
175.70 ms
(withOMP_NUM_THREAD = 4
,intra_num_threads = 1
)267.25 ms
(withOMP_NUM_THREAD = 1
,intra_num_threads = 1
)Thanks in advance!
cc: @tianleiwu