microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.15k stars 2.86k forks source link

ONNX Runtime performing worse than PyTorch on BERT CPU Inference #4654

Open salimmj opened 4 years ago

salimmj commented 4 years ago

Describe the bug

I run this notebook on my machine and I can not replicate the performance improvement results.

System information Macbook Pro:

To Reproduce

Expected behavior I expected results similar to the ones cached in the notebook:

But on my Mac I got:

And on EC2 :

Thanks in advance!

cc: @tianleiwu

tianleiwu commented 4 years ago

@fs-eire to help run the notebook in Macbook to see whether it could reproduce the issue.

salimmj commented 4 years ago

Additional note: ONNX performs slightly better than PyTorch (by ~5-10%) when intra_num_threads is unspecified.

tianleiwu commented 4 years ago

@salimmj, thanks for reporting the issue.

@fs-eire reproduced the issue in one mac notebook. It seems that onnxruntime is not built with openmp (looks like an issue of mac build pipeline).

The best setting for BERT on onnxruntime 1.3.0 and 1.4.0 for Mac: intra_num_threads = (unspecified, or number of physical/logical cores) and no need to set OMP_NUM_THREAD.

I'll update the notebook so that it could also get expected results on Mac.

salimmj commented 4 years ago

Thanks for looking into this. I also got better benchmarking results with intra_num_threads unspecified.

OMP_NUM_THREAD=1 is desirable when optimizing for overall throughput with multiple gRPC workers.

I will test this on EC2 with intra_num_threads unspecified to check that the issue is only on mac and comment the results.

fs-eire commented 4 years ago

Merged PR #4774. However this change will not apply to published packages. Need local build to reflect the change or wait for the next release.

slevental commented 3 years ago

Could anyone provide some details on how OpenMP improves performance vs threadpools that are used by default? I'm curious because Java API doesn't have openmp enabled in the native build, does it make sense to use OpenMP to improve throughput or tuning threadpools with intra/inter threads should be enought to achieve similar numbers?

tianleiwu commented 3 years ago

@slevental, the internal thread pools could achieve similar performance on most models we tested. OpenMP is slightly better on transformers (like BERT) models. Probably it is because we have tuned the optimization of BERT for OpenMP.

guotong1988 commented 3 years ago

What is the conclusion?

guotong1988 commented 3 years ago

Could anyone provide some details on how OpenMP improves performance vs threadpools that are used by default? I'm curious because Java API doesn't have openmp enabled in the native build, does it make sense to use OpenMP to improve throughput or tuning threadpools with intra/inter threads should be enought to achieve similar numbers?

What's wrong with https://github.com/microsoft/onnxruntime/issues/6031 ? Thank you!