Open antkillerfarm opened 3 years ago
I am not very familiar with the NXP systems, but does the IMX865 use an Arm big.LITTLE processor? If so, it could be that the different timings reflect different placements of the threads by the OS. If running on Linux and testing with onnxruntime_perf_test, could you try two experiments:
Set the number of threads used in the ORT thread pool to match the number of big cores, and run onnxruntime_perf_test via numactl with the --physcpubind option to constrain the threads to the big cores (via the numbers from the processor fields of /proc/cpuinfo).
Repeat setting the number of threads to match the number of small cores, and with the --physcpubind option switched to use the small core IDs.
When I run:
./onnxruntime_perf_test -v -m times -r 3000 ../../model_zoo/mnist-8/model.onnx ./result.txt
on PC (CPU backend).I find that: (fragment in test result)
The 1770th, 1787th, 1792th execute cost ~0.000115s, but others cost ~0.000107s.
In fact, this shaking is more obviously on embedded device, which has poorer CPU & memory than PC.
Here is the test result fragment on NXP IMX865 Board:
The 27th, 33th, 40th execute cost ~0.345s, but others only cost ~0.0027s. Performance drop 120X!