giampaolo / psutil

Cross-platform lib for process and system monitoring in Python
BSD 3-Clause "New" or "Revised" License
10.19k stars 1.38k forks source link

[Linux] Provide access to Pressure Stall Information metrics #1932

Open MrPippin66 opened 3 years ago

MrPippin66 commented 3 years ago

OS: Linux (kernels at 4.20 or higher, unless vendor has back ported feature) Type: Performance metrics for CPU, Memory & IO

Summary:

https://www.kernel.org/doc/html/latest/accounting/psi.html

Though the is a relative new feature, it will become a common use of information for determining performance issues on systems.

I would requests this information become available (if enabled in OS for psi and/or cgroup) via the psutil framework, primarily so that tools built atop this framework can readily use this for monitoring purposes.

Ultimately, I'd suggest a new category (psi) to gather these values from.

psutil.psi_cpu()

psutil.psi_memory()

psutil.psi_io()

I'd request both the system level and cgroup2 level data be presented for each category.

giampaolo commented 3 years ago

Mmm... It's the first time I hear about this. I'm trying to understand how it works (https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/). The information in those 3 files is easy to extract. What's more difficult is understanding how to interpret that data and imagine an actual use case. For instance, psutil doc shows an actual use case for psutil.getloadavg(), showing how to translate those raw numbers to get a percentage of CPU usage/load over time:

>>> import psutil
>>> psutil.getloadavg()
(3.14, 3.89, 4.67)
>>> psutil.cpu_count()
10
>>> # percentage representation over the last 1, 5, 15 mins
>>> [x / psutil.cpu_count() * 100 for x in psutil.getloadavg()]
[31.4, 38.9, 46.7]

If we were to add this I would like to see something similar to provide in the doc: some actual code which does something useful with those raw numbers extracted from /proc/pressure. But in order to do that I/we'd have to properly understand how this works first. =)

MrPippin66 commented 3 years ago

PSI was developed by Facebook. They posted a decent explanation of how they use it, and the benefits it's given them.

https://lwn.net/Articles/759658/

And FYI, that article gives detailed response of the issue with "getloadavg", which goes above the issues we've encountered (namely that you can have several active threads that are active for a small period of their allocated slice. They manifest as high load averages, but overall low CPU utilization).

And swap thrashing isn't the only memory utilization metric that results in low processor throughput, which this facility would include, without having complicated monitoring scripts (high reclaim rates, high faults, etc.)

I think having this available would simplify monitoring,. and hopefully would be used upstream in monitoring products, like "ncpa", etc.

MrPippin66 commented 1 year ago

@giampaolo Is this still a feature candidate?

MrPippin66 commented 1 year ago

FYI, PSI data is reported in 'sar' data for all current Linux distributions. I think being able to report this data in 'psutil' merits attention.