giampaolo / psutil

Cross-platform lib for process and system monitoring in Python
BSD 3-Clause "New" or "Revised" License
10.3k stars 1.39k forks source link

[WIndows] CPU percent is incorrect (perf counters) #2467

Open Extiward opened 3 weeks ago

Extiward commented 3 weeks ago

Summary

Description

When using cpu_percent with percpu=False to display CPU load the value is always much lower than expected, e.g. cpu_percent returns load or single digit percent, while CPU actually is loaded to e.g. 50-70% (when looking at Task Manager). When using percpu=True only one element in the array contains large number (the high load element seems to change from run to run), which roughly corresponds to the full CPU utilization (see output example below). CPU has 12 cores and 24 threads.

Code snippet:

while True:
            cpu_load = psutil.cpu_percent(interval=1, percpu=True)

            print(f"CPU load: {cpu_load}%")
            time.sleep(1)

Example output: CPU load: [0.0, 0.0, 1.6, 3.1, 0.0, 3.1, 0.0, 4.7, 0.0, 0.0, 0.0, 1.6, 0.0, 4.7, 1.6, 0.0, 1.6, 3.1, 3.1, 0.0, 0.0, 3.1, 1.6, 42.4]% CPU load: [3.1, 3.1, 6.2, 1.6, 0.0, 3.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.6, 0.0, 0.0, 1.6, 0.0, 0.0, 3.1, 1.6, 41.5]% CPU load: [0.0, 1.6, 6.2, 6.2, 0.0, 0.0, 1.6, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.1, 0.0, 0.0, 0.0, 0.0, 0.0, 70.1]% CPU load: [4.6, 0.0, 3.1, 4.7, 0.0, 1.6, 1.6, 1.6, 1.6, 1.6, 4.7, 3.1, 0.0, 3.1, 10.9, 3.1, 0.0, 4.7, 3.1, 10.9, 1.6, 0.0, 3.1, 50.0]% CPU load: [0.0, 0.0, 0.0, 6.3, 0.0, 0.0, 1.6, 3.1, 0.0, 0.0, 3.1, 0.0, 0.0, 3.1, 3.1, 1.6, 1.6, 3.1, 0.0, 3.1, 0.0, 1.6, 0.0, 35.4]%

image

That can't be correct behavior. Expected result would be to have roughly even load across all cores as seen in the attached screenshot.

dbwiddis commented 4 days ago

Internally the code uses NtQuerySystemInformation https://github.com/giampaolo/psutil/blob/7cae974b9baa669f3ce738f5cd02458cd0d8c7d9/psutil/arch/windows/cpu.c#L103-L105

Unfortunately that function's documentation says

[NtQuerySystemInformation may be altered or unavailable in future versions of Windows. Applications should use the alternate functions listed in this topic.]

Of course the alternate function is completely wrong, it is the one that only gives System times:

Use GetSystemTimes instead to retrieve this information.

I've seen other functions changing behavior in Windows 11.

This code should probably be switched to use performance counters ("Processor Information").

giampaolo commented 1 day ago

When using cpu_percent with percpu=False to display CPU load the value is always much lower than expected, e.g. cpu_percent returns load or single digit percent, while CPU actually is loaded to e.g. 50-70% (when looking at Task Manager). When using percpu=True only one element [...]

According to this description, both cpu_percent(percpu=False) and cpu_percent(percpu=True) return incorrect values (@Extiward am I correct?).

Note: internally cpu_percent(percpu=False) relies on GetSystemTimes. Differently from NtQuerySystemInformation, MS doc does not officially discourage it or deprecate it. It even says:

On a multiprocessor system, the values returned are the sum of the designated times across all processors.

So are we sure GetSystemTimes is at fault here? It's an old and well established Windows API.

For reference, here's the links to psutil implementation

giampaolo commented 1 day ago

Related https://github.com/giampaolo/psutil/issues/2384#issuecomment-2011099016.

giampaolo commented 1 day ago

ChatGPT seems to confirm GetSystemTimes is basically deprecated on modern system:

Q: is it true that GetSystemTimes no longer returns accurate results on recent windows versions, and instead I should use performance counters

Yes, this is accurate to an extent. On recent versions of Windows, starting with Windows 8 and Windows Server 2012, the behavior of the GetSystemTimes function changed due to improvements in the way the operating system tracks CPU usage, particularly on modern hardware with dynamic clock speeds (e.g., Turbo Boost, power-saving features).

Modern CPUs adjust their clock speeds dynamically based on workload and power management policies. GetSystemTimes relies on tick-based counters, which can become inconsistent when the clock speed changes.

The precision of the timers used internally by GetSystemTimes may not account for all variations in CPU usage, especially on systems with energy-saving features enabled.

Scaling Issues: On systems with multiple cores or hyper-threading, the reported CPU times may not fully align with actual performance or workload distribution.

It's unfortunate I have to apprehend this from AI instead of MS doc. :-\

If this is true, it may indeed make sense to calculate system CPU times by using perf counters. I remember you Daniel (@dbwiddis) did something similar: you replaced a native Windows API with performance counters for swap_memory() in #2160. Perhaps that suggests perf counters should also be used elsewhere, not only in swap and CPU functions (sigh!).

There seems to be one problem: according to code (e.g. see here and here) some performance counters may be disabled and fail. As such, we should probably ship a dual implementation: try perf counters first else use Windows native API.

And still unsolved, since we're discussing 2 problems here: it's not clear how to replace NtQuerySystemInformation to collect per-CPU metrics.

dbwiddis commented 1 day ago

If this is true, it may indeed make sense to calculate system CPU times by using perf counters. I remember you Daniel (@dbwiddis) did something similar: you replaced a native Windows API with performance counters for swap_memory() in #2160. Perhaps that suggests perf counters should also be used elsewhere, not only in swap and CPU functions (sigh!).

Yes, that's generally what I've done over on the Java/JNA side.

There seems to be one problem: according to code (e.g. see here and here) some performance counters may be disabled and fail. As such, we should probably ship a dual implementation: try perf counters first else use Windows native API.

Having navigated through the range of associated problems over the years and implemented multiple fallbacks, yes, "it's complicated". Here are some of the obstacles:

  1. Performance counters can get corrupted, which breaks them. There are MS Docs on fixing them but I've found that an error message pointing the user to the docs for fixing them is the best option here.
  2. Performance counters are tricky with internationalization settings. In particular if you start with a default English configuration, switch it to another language, and then switch it back to English, the English counter name data gets deleted. This is similar to option 1. Print an error message.

In both of the above cases, it may be possible to use a WMI table to fetch the counters from the same source without using the PDH functions. It can be slower (COM overhead) but typically works as a backup.

  1. Performance counters can be manually disabled by users. This is a common hack in the online gaming community where speed matters and players both hyperclock and disable as much background processes as possible.

When they're disabled, you can't do anything, WMI doesn't even work as a backup. Just say so in an error message; however, allow for configuration to minimize log messages in that case. :)

  1. In some containers, special configuration is required to expose the counters to the container. I know this is true for Windows containers, not sure about others. This is (like other container issues) difficult to detect at runtime.

And still unsolved, since we're discussing 2 problems here: it's not clear how to replace NtQuerySystemInformation to collect per-CPU metrics.

That's the "Processor Information" performance counters. Here's the Corresponding WMI Table (it's the 'formatted' one that gives usage metrics you'd expect, the 'raw' data gives "ticks").

Note "Processor Information" is processor-group aware but is Windows 7+. There is a similar "Processor" performance counter that can be used pre-Win7, but it is not processor-group aware.

Also note "Processor Information" can give you "real" tick counts, but then your users will complain that you don't match the Task Manager output, so you'll need a configuration option to choose whether to use the "Utility" counters rather than the "Percent" counters.

Have fun storming the castle!

giampaolo commented 1 day ago

That's a lot to chew on. Let's see what I can do. In the meantime... thanks as always. =) The above info is very useful.