microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.26k stars 2.87k forks source link

Performance shaking #6301

Open antkillerfarm opened 3 years ago

antkillerfarm commented 3 years ago

When I run: ./onnxruntime_perf_test -v -m times -r 3000 ../../model_zoo/mnist-8/model.onnx ./result.txt on PC (CPU backend).

I find that: (fragment in test result)

mnist,0.000107583,56377344,49,1769
mnist,0.000115512,56377344,49,1770
mnist,0.000107526,56377344,49,1771
mnist,0.000107483,56377344,49,1772
mnist,0.000107655,56377344,49,1773
mnist,0.000107925,56377344,49,1774
mnist,0.000107888,56377344,49,1775
mnist,0.00010804,56377344,49,1776
mnist,0.000107483,56377344,49,1777
mnist,0.000107841,56377344,49,1778
mnist,0.000107101,56377344,49,1779
mnist,0.000107525,56377344,49,1780
mnist,0.000107198,56377344,49,1781
mnist,0.00010805,56377344,49,1782
mnist,0.000107586,56377344,49,1783
mnist,0.000107924,56377344,49,1784
mnist,0.000107997,56377344,49,1785
mnist,0.000107628,56377344,49,1786
mnist,0.000116554,56377344,49,1787
mnist,0.000108386,56377344,49,1788
mnist,0.000107384,56377344,49,1789
mnist,0.000107415,56377344,49,1790
mnist,0.000107513,56377344,49,1791
mnist,0.000113749,56377344,49,1792
mnist,0.000108216,56377344,49,1793
mnist,0.000108261,56377344,49,1794

The 1770th, 1787th, 1792th execute cost ~0.000115s, but others cost ~0.000107s.

In fact, this shaking is more obviously on embedded device, which has poorer CPU & memory than PC.

Here is the test result fragment on NXP IMX865 Board:

mnist-8,0.00272813,56537088,0,25
mnist-8,0.0027195,56537088,0,26
mnist-8,0.345989,56537088,0,27
mnist-8,0.002778,56537088,0,28
mnist-8,0.00272412,56537088,0,29
mnist-8,0.00272725,56537088,0,30
mnist-8,0.00272125,56537088,0,31
mnist-8,0.00271712,56537088,0,32
mnist-8,0.345096,56537088,0,33
mnist-8,0.00281438,56537088,0,34
mnist-8,0.0027335,56537088,0,35
mnist-8,0.00272513,56537088,0,36
mnist-8,0.00271863,56537088,0,37
mnist-8,0.00272275,56537088,0,38
mnist-8,0.00272738,56537088,0,39
mnist-8,0.344997,56537088,0,40
mnist-8,0.00278625,56537088,0,41
mnist-8,0.00273625,56537088,0,42

The 27th, 33th, 40th execute cost ~0.345s, but others only cost ~0.0027s. Performance drop 120X!

tlh20 commented 3 years ago

I am not very familiar with the NXP systems, but does the IMX865 use an Arm big.LITTLE processor? If so, it could be that the different timings reflect different placements of the threads by the OS. If running on Linux and testing with onnxruntime_perf_test, could you try two experiments:

  1. Set the number of threads used in the ORT thread pool to match the number of big cores, and run onnxruntime_perf_test via numactl with the --physcpubind option to constrain the threads to the big cores (via the numbers from the processor fields of /proc/cpuinfo).

  2. Repeat setting the number of threads to match the number of small cores, and with the --physcpubind option switched to use the small core IDs.