giampaolo / psutil

Cross-platform lib for process and system monitoring in Python
BSD 3-Clause "New" or "Revised" License
10.28k stars 1.38k forks source link

psutil.cpu_count: Add argument(s) to allow differentiating "performance cores" from "efficiency cores" #2034

Open ghost opened 2 years ago

ghost commented 2 years ago

Summary

Description

Many applications need to spawn a number of worker threads or processes, where the number of physical cores is the ideal number of workers to create. The existing psutil.cpu_count(logical=False) has served us well in that regard, however, changes in modern hardware are causing that API to become inadequate.

For example, ARM big.LITTLE (including Apple M1), Intel Alder Lake (and rumor has it future AMD CPUs) will feature a mix of "performance cores" and "efficiency cores", some of which may have SMT, and others not, even within the same CPU.

AFAIK, the use-case for cpu_count is almost always going to be the performance core count. Except in the odd corner case where the performance and efficiency cores can be used at the same time, and the performance delta does not matter because the individual jobs are small and there are many of them.

My proposal is to add something like this:

class CPUCoreType(Enum):
    ALL = 0
    PERFORMANCE = 1
    EFFICIENCY = 2

def cpu_count(logical: bool=True, core_type: CPUCoreType=CPUCoreType.ALL):
    ...

The use of an enum should keep this feature future proof as as CPU core types become more exotic and diverse than what they are today.

Things to consider:

ghost commented 2 years ago

I saw your comment in email, but I don't see it on the issue, not sure if you deleted it, or if Github is not updating it's website correctly.

Regarding the comment you shared vs. my proposal. The problem with your original comment is that logical=True|False would still be a desirable feature for this use-case. Although adding a kind="..." argument is a valid alternative to using an Enum.

Example article:

Core i9-12900K / KF    8P + 8E | 16 Cores / 24 threads

8 "performance cores" with SMT for a total of 16 logical "performance cores". 8 "efficiency cores" (physical and logical) without SMT. But it is reported as a 16 core / 24 logical core part, which is usually not helpful for creating worker threads and processes. Would definitely want a logical=True|False argument to differentiate the different types of cores and whether they have SMT. But maybelogical=True|False would have no effect on kind="socket"|"numa", for example.

More references: https://www.pcmag.com/news/intels-alder-lake-combines-performance-and-efficiency-cpu-cores-on-one http://meseec.ce.rit.edu/551-projects/spring2017/1-3.pdf

giampaolo commented 2 years ago

Mmm... I'm not sure I fully understand how this would work in practice. If you're interested in "performance" vs. "efficiency" I guess you're supposed to know which CPUs (IDs) are "performant" vs. "efficient", and then tell the OS to assign a certain process to run on those CPUs. E.g., in hypothetical code:

>>> psutil.performant_cpu_ids()
[0, 2, 4]
>>> psutil.Process().cpu_affinity([0, 2, 4])  # set

If instead you only know the total number of those CPUs, what can you do with that info alone?

My proposal is to add something like this:

class CPUCoreType(Enum): ALL = 0 PERFORMANCE = 1 EFFICIENCY = 2

def cpu_count(logical: bool=True, core_type: CPUCoreType=CPUCoreType.ALL):

There's a topic about changing cpu_count() signature to extend existing use cases: https://github.com/giampaolo/psutil/issues/1392#issuecomment-710745882. If we adopt that signature, this new API would look like this:

>>> psutil.cpu_count("performance")
4
>>> psutil.cpu_count("efficiency")
4
>>>

I saw your comment in email, but I don't see it on the issue, not sure if you deleted it, or if Github is not updating it's website correctly.

Sorry, I deleted it because I hit "submit" too son.

ghost commented 2 years ago

Given what security engineers have done in recent years to all operating systems (especially Linux), setting thread affinity is a questionable proposition. setuid and setcap binaries are treated with great suspicion, I've found it is extremely problematic to set scheduler parameters or thread affinity from an application intended to be invoked by users, especially if it is run from an interpreter like Python (yet compiled daemons running as root get far less scrutiny).

However, with these new "hybrid" CPUs, it's pretty much a given that your CPU-intensive workload will get migrated to the performance cores (that's the entire point of a hybrid CPU). If a workload isn't CPU intensive, then it doesn't really matter if gets migrated or not. I'm not sure if the special kernel scheduling hacks around these CPUs will even respect thread affinity or not. But aligning the number of worker threads with the number of physical "performance" cores is still of critical importance.

For the implementation, unless there are good OS-specific APIs around this, probably the best thing to do is maintain a list of CPU core families or code names and the associated core metadata such as "performance" vs. efficiency. This list would be relatively small, given that Intel, AMD, ARM, Apple and IBM/Power only release new core families once a year at most. The maintenance burden should be relatively low.

giampaolo commented 2 years ago

However, with these new "hybrid" CPUs, it's pretty much a given that your CPU-intensive workload will get migrated to the performance cores

How do you tell the OS to use those cores though? To my knowledge that's sched_setaffinity (which is what psutil uses).

ghost commented 2 years ago

How do you tell the OS to use those cores though? To my knowledge that's sched_setaffinity (which is what psutil uses).

It is safe to assume that if your worker threads are CPU-heavy, that the OS kernel would automatically migrate the threads to the performance cores if they were already on the efficiency cores. Or if CPU usage goes down later, the threads could be migrated to efficiency cores.

The CPU vendors contribute special scheduling code for hybrid CPUs to the various OS kernels. If these hybrid CPUs were being treated like normal CPUs, they would have wildly inconsistent performance. If thread affinity has been set, I'm not sure what would happen (user-set affinity could be ignored, or maybe not) the behavior would be implementation-specific per-OS.

dbwiddis commented 2 years ago
  • Are there reliable cross platform ways to determine this information? Or will it require maintaining a database, or asking the OS vendors to provide an API to query this information?

FYI, I just implemented a feature similar to this in Java (thus cross-platform). Some helpful notes if this gets implemented:

giampaolo commented 2 years ago

Hello Daniel, thanks for providing such details.

Windows (10+) exposes an efficiencyClass member of the PROCESSOR_RELATIONSHIP field (only for ProcessorCore) that gives a relative efficiency measure, e.g., 1 = more performance, 0 = more efficient.

Do you provide that on a per-cpu basis? In that case, it seems to me this belongs more to a cpu_info(percpu=True) or cpu_topology() API of some sort, which could provide multiple info about each CPU. I did something similar already, even though NOT on a per-cpu basis, in https://github.com/giampaolo/psutil/issues/1894 On Linux:

>>> psutil.cpu_info()
{'arch': 'x86_64',
 'byteorder': 'little',
 'flags': 'fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat '
          'pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx '
          'pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good '
          'nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 '
          'monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid '
          'sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx '
          'f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb '
          'invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi '
          'flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep '
          'bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt '
          'xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp '
          'hwp_notify hwp_act_window hwp_epp md_clear flush_l1d',
 'l1d_cache': 32768,
 'l1i_cache': 32768,
 'l2_cache': 262144,
 'l3_cache': 6291456,
 'model': 'Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz',
 'vendor': 'GenuineIntel'}

Perhaps we can add a mode="powersave" / "performance" field?

Linux sysfs has /sys/devices/system/cpu/cpuX/cpu_capacity which is a performance measure in DMIPS/MHz. (Higher = P, lower = E). Currently for ARM, but following LKML this seems to be the future for Intel chips as well.

I wonder how this relates to /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor (see link). Are they the same thing? According the link above, on Linux we could implement something like this:

>>> psutil.cpu_mode()
'performance'
>>> psutil.cpu_mode(percpu=True)
['performance', 'performance', 'performance', 'performance']

...which could also be used for setting (actually this would be extremely cool):

>>> psutil.cpu_mode("powersave")
>>> psutil.cpu_mode(['performance', 'performance', 'powersave', 'powersave'])

Do you know if PROCESSOR_RELATIONSHIP -> efficiencyClass on Windows can be used to achieve this?

dbwiddis commented 2 years ago

Do you provide that on a per-cpu basis?

Depends on what you mean by "cpu" here. :)

For my purposes, I created a new PhysicalProcessor object (representing a core). So, consider the top-of-the-line Alder Lake i9-12900K. It has:

Assuming HT is on, I have two separate enumerations: a LogicalProcessor list which would include all 24, and include topology information aligning them to the 16 physical cores, and separately a PhysicalProcessor list containing those 16 cores only.

In the Windows enumeration you would have (among other output) 16 PROCESSOR_RELATIONSHIP structures with RelationProcessorCore in the parent structure; 8 of those (efficiency) cores would have EfficiencyClass=0 and Flags=0 meaning no SMT, with single-bit GROUP_MASK, whie the other 8 (performance) cores would have EfficiencyClass=1 and Flags=1 (LTP_PC_SMT) and two bits set in GROUP_MASK.

For Linux cpu_capacity, it appears the sysfs entries are per-logical-cpu ,however. I can only assume that in a hyperthreading scenario, you'd have duplicate information, but at least you'd have a core_id to correlate it with the physical cores.

For everything else, you're stuck enumerating textual output matching "CPU X" (presumably also logical processors) with a textual description of the processor, e.g, this dmesg. For now, all the ARM big.LITTLE chips are a known set of Cortex-A7x (P-) and Cortex-A5x (E-) names; and for Apple M1 we know they're all firestorm and icestorm. For now. I haven't yet seen a dmesg output on an Alder Lake chip, would be nice if I had an unlimited budget to buy one just to run the command. :-)

Perhaps we can add a mode="powersave" / "performance" field?

I wouldn't use "powersave" here; industry branding appears to align with "performance" and "efficiency" (or P-core and E-core). Also while current hybrid chips only have two types, in theory in the future we could have some mid-sized cores as well. Windows' choice of a "relative efficiency measure" aligns with this potential.

I wonder how this relates to /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

I don't think they are the same at all. You can (and often do) change frequency on one or more processors of the same type for performance considerations. The "capacity" represents a maximum performance level without scaling; and I don't think it's user-adjustable.

Do you know if PROCESSOR_RELATIONSHIP -> efficiencyClass on Windows can be used to achieve this?

No, it's just an output identifying the type of chip, basically read-only.

dbwiddis commented 2 years ago

Just took some time to catch up the entire thread (that I skimmed earlier) and wanted to highlight a few points:

The original request was just for the "number of performance cores" for the purpose of limiting workloads.

In your Windows implementation you are already iterating over an array of SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX structures, but you are only counting them up (ncpus += 1). This would result in the 16 in the Alder Lake example above.

Regarding #1392 (comment), and related comments about topology, you need a bit of both. In my project:

On most OS's, the combination of package, core, and chip id are necessary for full topology, with numa nodes+logical processors as a separate thing. On Windows there's a numbered logical topology with meaning in the OS (numa+logical) and a physical topology (package+core) without any numbering.

You could, in theory, keep the efficiency value at the lowest level (that's how you're going to collect it on every OS except Windows) so you could have "performance logical processors" and "efficiency logical processors" (although I'd bet one non-HT efficiency LP could process one task faster than two HT efficiency LP could process two similar tasks). So given the proposed API:

def cpu_count(logical: bool=True, core_type: CPUCoreType=CPUCoreType.ALL):

Then:

dbwiddis commented 2 years ago

FYI, the GCC Compile farm just made an M1 (4 performance+4 efficiency cores) Linux machine available, so I was able to test out my own API. Here's the output for my processor information implementation. You can see package 0 cores 0,1,2,3 are "efficiency" (lower #, 459) and package 1 cores 0,1,2,3 are "performance" (higher #, 1024):

 2 physical CPU package(s)
 8 physical CPU core(s) (4 performance + 4 efficiency)
 8 logical CPU(s)
Identifier: aarch64 Family 8 Model 0x023 Stepping r0x1p1
ProcessorID: 6118023100000000
Microarchitecture: unknown
 Cores:
  0,0: efficiency=459, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
  0,1: efficiency=459, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
  0,2: efficiency=459, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
  0,3: efficiency=459, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
  1,0: efficiency=1024, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
  1,1: efficiency=1024, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
  1,2: efficiency=1024, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
  1,3: efficiency=1024, id=cpu:type:aarch64:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0010,0011,0014,0015,0017,0018,0019,001A,001B,001C,001D,001E,001F,0020,0027,0028
HunterAP23 commented 2 years ago

Bumping this as I too am interested in having this functionality added. Pardon my typing as I am writing this out on mobile.

As for the matter of what arguments should be used for the cpu_count function, this is what I'd request to have if it does get implemented:

  1. Keep the logical argument
  2. Add an argument for core kind. I liked the earlier suggestion of using enums for the types, as we don't know if future versions will add more core types (maybe have performance, efficiency, and background level cores?).
  3. Both of those arguments should be usable simultaneously, so you can get any combination of logical/physical and performance/efficiency cores.
  4. Provide method for getting CPU core affinity for logical/physical core and performance/efficiency cores combinations. This might be tricky since different operating systems might list the cores in different orders, but the gist would be to ask for the CPU affinity of something like just the physical performance cores to them assign to tasks or processes.

As for the topic of operating systems having updated schedulers to handle heterogeneous CPU core architectures, there can definitely be times where a user would want to keep their processes on just the performance cores, just the efficiency cores, or some custom combination (like 1 performance core and two efficiency cores). I myself have an app that benchmarks different implementations of the same code for comparison purposes, where each different implementation would run on its own logical core. Previously ot just used all cores for the benchmark, but with performance/efficiency cores that skews the results to whatever implementation gets to be put on a process that's put on a performance core.

dbwiddis commented 2 years ago
  1. Provide method for getting CPU core affinity for logical/physical core and performance/efficiency cores combinations. This might be tricky since different operating systems might list the cores in different orders, but the gist would be to ask for the CPU affinity of something like just the physical performance cores to them assign to tasks or processes.

To be more precise in terminology, "affinity" generally relates to processes being assigned to particular CPUs. I think "mask" or "bitmask" or "cpumask" is a better term to use: it is generally the argument when setting affinity.

HunterAP23 commented 2 years ago

Good point, that is indeed what I meant and should've specified it better.

I think the hardest part of this whole thing is actually finding out whether a core is "performance" or "efficiency" in a cross-platform way. I'm not aware of any portion of Windows or Linux that exposes that information to the user, although I'm sure there are reliable ways to do this that I'm not aware of.

I was previously using pywin32 to get CPU information, but there weren't any specific attributes that give usable information on whether or not a given core is flagged as a performance or efficiency one.

dbwiddis commented 2 years ago

I'm not aware of any portion of Windows or Linux that exposes that information to the user, although I'm sure there are reliable ways to do this that I'm not aware of.

I described several ways in this comment and have implemented them (cross platform) in Java, links in other comments above.

L3337 commented 1 year ago

OP here. (new Github account)

Regarding the topic of thread affinity on heterogeneous CPUs, I recently learned that MacOS and Windows have these APIs for setting thread QoS that affects the decision of whether to run each thread on a performance or efficiency core.

MacOS Windows

I am not aware that Linux has a comparable API yet. But this seems to be the future of managing thread affinity, so enabling users to set thread affinity manually should probably not be a design goal for this feature.

dbwiddis commented 1 year ago

enabling users to set thread affinity manually should probably not be a design goal for this feature

Agreed. However reporting processor numbers and their correspondence to core types should be. I'll leave it to others to define the API but on my (Java-based) project I have enough objects/lists to construct an output like this:

Identifier: Apple Inc. Family 0x1b588bb3 Model 0 Stepping 0
ProcessorID: 0100000c1b588bb3
Microarchitecture: ARM64 SoC: Firestorm + Icestorm
 Topology:
  LogProc  P/E Proc  Pkg NUMA PGrp
        0    E    0    0    0    0
        1    E    1    0    0    0
        2    E    2    0    0    0
        3    E    3    0    0    0
        4    P    4    0    0    0
        5    P    5    0    0    0
        6    P    6    0    0    0
        7    P    7    0    0    0
HunterAP23 commented 1 year ago

I think having the ability to view and change both the specific core affinities as well as the thread QoS would both be beneficial, and would work better in conjunction with one another.

With just the thread QoS setting, you would only be able to see what QoS level an application is using at any given time. Having just the numbered thread/core affinity lets you do the same thing but without any inference as to what cores are considered "performance" vs "efficiency" ones, and would require looking at the individual core's efficiency class or other similar object that may or may not be present or easily accessible on other operating systems.

A combined approach would give you all the information you would need about the processor cores, and if someone is interested enough in manually setting core affinities manually down to the individual cores then they can look into implementing that on top of this.