delay between inferences slows down inference

tomaszmi commented 3 years ago

Describe the bug It seems that WinML inference duration is correlated with the frequency with which it is called. The more frequent called the faster it is.

Urgency none

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Microsoft Windows 10 Home
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.7.2
Python version: 3.6.5
Visual Studio version (if applicable): 16.9.4
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory: NVIDIA GeForce GTX 1050, 18341 MB

To Reproduce A dedicated repository with full source code reproducing the behavior and clear instructions (in README.md) has been prepared here: https://github.com/tomaszmi/winml_inference_time

Expected behavior Inference durations is not correlated to how often inference is called (apart from first run).

Screenshots WinML GPU inference durations for various sleep durations: winml_inference_times_gpu WinML CPU inference durations for various sleep durations: winml_inference_times_cpu

Additional context I observe the same behavior when using another (ICNET-based) model.

mrry commented 3 years ago

In the past, on other projects, I've noticed that latency can be higher on an idle system because of heuristics that cause worker threads to block in the OS after some period of time. Then, when you perform a request, it takes longer because the cost of waking a blocked thread is much higher than passing a work item to an active thread.

I'm not sure if this would explain the slowdown you are seeing here though.

@tlh20 Could this be related to the spin-versus-sleep implementation in the threadpool? Is there a simple experiment with threadpool settings that @tomaszmi could try?

mrry commented 3 years ago

@tomaszmi In the meantime, if you could collect separate CPU profiles from the fast and slow regimes, that would probably help to diagnose the problem.

tlh20 commented 3 years ago

ORT provides control over whether to spin-then-block (default) or to block immediately via a session configuration option https://github.com/microsoft/onnxruntime/commit/679718b12f9f1afaae43f2f7f49d370abf1025b7. I am not sure if that is exposed via WinML currently.

At first glance the ~50ms degradation is longer than I would expect for having an extra pass through the OS scheduler, but it would be good to investigate further.

One idea for exploring this is to take ORT's parallelism out of the picture by disabling multi-threading via https://docs.microsoft.com/en-us/windows/ai/windows-ml/native-apis/intraopnumthreads#:~:text=By%20default%2C%20WinML%20sets%20the%20value%20as%20the,in%20an%20inefficient%20threadpool%20and%20a%20slower%20evaluation. If the effect is still seen with 1 thread then that suggests it is not related to the ORT thread pool's behavior -- there may well be spin-or-block or other heuristics in the system though.

If it is possible to collect a CPU scheduler trace that would be interesting.

tomaszmi commented 3 years ago

Thank you for the responses and hints. I'm setting number of intra op threads to 1: https://github.com/tomaszmi/winml_inference_time/blob/24e052cddc5eb03be2f8ca834ea13e753a529fa4/main.cpp#L72

Results: Inferences on CPU take much longer now however the trend is preserved, i.e. bigger delay makes longer inferences runs.

winml_cpu_1_thread_500_iterations

nap_time
0                  236  259  239.176   2.583481
50                 235  274  242.128   7.411442
100                235  291  248.838  12.268855
150                237  283  256.486  11.041814
200                235  280  247.884  11.404326

During that run I've profiled the CPU and GPU load (diagsession), the results are uploaded to my google drive (*.diagsession file ~250MB): https://drive.google.com/file/d/1IVZaXuAhaq55j5PDtv7yZE_O_oQZFnkT/view?usp=sharing

tomaszmi commented 3 years ago

Is there anything else I could do in order to help in solving this issue? Are you able to reproduce that behavior? Is there any solution coming soon? That issue prevents me in releasing winml-based functionality to production.

tlh20 commented 3 years ago

@martinb35, @orilevari, do you have any thoughts on the issue here? The investigation is showing longer latency serving requests if there is a ~100ms latency between requests rather than back-to-back. Experiments showed the effect remains when setting intra-op threads to 1 (suggesting this is not coming from ORT thread pool sleep vs spin policies). The effect remains when simplifying the workload to just use the CPU.

Fafa87 commented 3 years ago

I also experience the similar problem. I need to tune the resource and battery usage of the application but it is not possible if this causes the runtime to go up. I think that it also makes the GPU usage go up which can cause a nasty feedback loop.

It would be great to be able to use WinML as it give a perfect abstraction to the wide range of Windows devices but it is kind of the blocker.

feiyuhuahuo commented 2 years ago

The same problem for me on WIN10 with Python API. About my model, if I sleep from 20ms ~ 2000ms after each inference iteration, the inference time increases from 26ms~250ms. But there's no problem on UBUNTU20.04

fdwr commented 2 years ago

@tomaszmi: Laptop? This is common behavior on laptops for GPU battery saving where the video driver applies dynamic frequency scaling, running in low power mode until enough frames have been submitted to increase performance. It occurs at the Nvidia graphics driver level, and I don't know that ORT can influence that :/. Ubuntu 20.04 uses a different graphics driver, and so that's why it's not seen on that OS, despite running the same ORT and same GPU. You might find some settings in the Nvidia control panel for that, but unfortunately it requires the user to configure it before running your application. e.g. This option looks promising (from adaptive to max/consistent):

jiangfeizi commented 1 year ago

@tomaszmi: Laptop? This is common behavior on laptops for GPU battery saving where the video driver applies dynamic frequency scaling, running in low power mode until enough frames have been submitted to increase performance. It occurs at the Nvidia graphics driver level, and I don't know that ORT can influence that :/. Ubuntu 20.04 uses a different graphics driver, and so that's why it's not seen on that OS, despite running the same ORT and same GPU. You might find some settings in the Nvidia control panel for that, but unfortunately it requires the user to configure it before running your application. e.g. This option looks promising (from adaptive to max/consistent):

Thanks! After setting,I restart machine and it works.

nums11 commented 1 year ago

Closing as fixed.

microsoft / onnxruntime

delay between inferences slows down inference #7449