Closed tomaszmi closed 1 year ago
In the past, on other projects, I've noticed that latency can be higher on an idle system because of heuristics that cause worker threads to block in the OS after some period of time. Then, when you perform a request, it takes longer because the cost of waking a blocked thread is much higher than passing a work item to an active thread.
I'm not sure if this would explain the slowdown you are seeing here though.
@tlh20 Could this be related to the spin-versus-sleep implementation in the threadpool? Is there a simple experiment with threadpool settings that @tomaszmi could try?
@tomaszmi In the meantime, if you could collect separate CPU profiles from the fast and slow regimes, that would probably help to diagnose the problem.
ORT provides control over whether to spin-then-block (default) or to block immediately via a session configuration option https://github.com/microsoft/onnxruntime/commit/679718b12f9f1afaae43f2f7f49d370abf1025b7. I am not sure if that is exposed via WinML currently.
At first glance the ~50ms degradation is longer than I would expect for having an extra pass through the OS scheduler, but it would be good to investigate further.
One idea for exploring this is to take ORT's parallelism out of the picture by disabling multi-threading via https://docs.microsoft.com/en-us/windows/ai/windows-ml/native-apis/intraopnumthreads#:~:text=By%20default%2C%20WinML%20sets%20the%20value%20as%20the,in%20an%20inefficient%20threadpool%20and%20a%20slower%20evaluation. If the effect is still seen with 1 thread then that suggests it is not related to the ORT thread pool's behavior -- there may well be spin-or-block or other heuristics in the system though.
If it is possible to collect a CPU scheduler trace that would be interesting.
Thank you for the responses and hints. I'm setting number of intra op threads to 1: https://github.com/tomaszmi/winml_inference_time/blob/24e052cddc5eb03be2f8ca834ea13e753a529fa4/main.cpp#L72
Results: Inferences on CPU take much longer now however the trend is preserved, i.e. bigger delay makes longer inferences runs.
nap_time
0 236 259 239.176 2.583481
50 235 274 242.128 7.411442
100 235 291 248.838 12.268855
150 237 283 256.486 11.041814
200 235 280 247.884 11.404326
During that run I've profiled the CPU and GPU load (diagsession), the results are uploaded to my google drive (*.diagsession file ~250MB): https://drive.google.com/file/d/1IVZaXuAhaq55j5PDtv7yZE_O_oQZFnkT/view?usp=sharing
Is there anything else I could do in order to help in solving this issue? Are you able to reproduce that behavior? Is there any solution coming soon? That issue prevents me in releasing winml-based functionality to production.
@martinb35, @orilevari, do you have any thoughts on the issue here? The investigation is showing longer latency serving requests if there is a ~100ms latency between requests rather than back-to-back. Experiments showed the effect remains when setting intra-op threads to 1 (suggesting this is not coming from ORT thread pool sleep vs spin policies). The effect remains when simplifying the workload to just use the CPU.
I also experience the similar problem. I need to tune the resource and battery usage of the application but it is not possible if this causes the runtime to go up. I think that it also makes the GPU usage go up which can cause a nasty feedback loop.
It would be great to be able to use WinML as it give a perfect abstraction to the wide range of Windows devices but it is kind of the blocker.
The same problem for me on WIN10 with Python API. About my model, if I sleep from 20ms ~ 2000ms after each inference iteration, the inference time increases from 26ms~250ms. But there's no problem on UBUNTU20.04
@tomaszmi: Laptop? This is common behavior on laptops for GPU battery saving where the video driver applies dynamic frequency scaling, running in low power mode until enough frames have been submitted to increase performance. It occurs at the Nvidia graphics driver level, and I don't know that ORT can influence that :/. Ubuntu 20.04 uses a different graphics driver, and so that's why it's not seen on that OS, despite running the same ORT and same GPU. You might find some settings in the Nvidia control panel for that, but unfortunately it requires the user to configure it before running your application. e.g. This option looks promising (from adaptive to max/consistent):
@tomaszmi: Laptop? This is common behavior on laptops for GPU battery saving where the video driver applies dynamic frequency scaling, running in low power mode until enough frames have been submitted to increase performance. It occurs at the Nvidia graphics driver level, and I don't know that ORT can influence that :/. Ubuntu 20.04 uses a different graphics driver, and so that's why it's not seen on that OS, despite running the same ORT and same GPU. You might find some settings in the Nvidia control panel for that, but unfortunately it requires the user to configure it before running your application. e.g. This option looks promising (from adaptive to max/consistent):
Thanks! After setting,I restart machine and it works.
Closing as fixed.
Describe the bug It seems that WinML inference duration is correlated with the frequency with which it is called. The more frequent called the faster it is.
Urgency none
System information
To Reproduce A dedicated repository with full source code reproducing the behavior and clear instructions (in README.md) has been prepared here: https://github.com/tomaszmi/winml_inference_time
Expected behavior Inference durations is not correlated to how often inference is called (apart from first run).
Screenshots WinML GPU inference durations for various sleep durations: WinML CPU inference durations for various sleep durations:
Additional context I observe the same behavior when using another (ICNET-based) model.