obilaniu / libpfc

A small library and kernel module for easy access to x86 performance monitor counters under Linux.
MIT License
92 stars 12 forks source link

Support for architecture version 5 #27

Closed goldsteinn closed 3 years ago

goldsteinn commented 3 years ago

Hi,

Tried to use the module and got the following dmsg:

[   48.641085] pfc: Kernel Module loading on processor Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Family 6 (6), Model 126 (07E), Stepping 5 (5))
[   48.641087] pfc: cpuid.0x0.0x0:        EAX=0000001b, EBX=756e6547, ECX=6c65746e, EDX=49656e69
[   48.641088] pfc: cpuid.0x1.0x0:        EAX=000706e5, EBX=00100800, ECX=7ffafbbf, EDX=bfebfbff
[   48.641089] pfc: cpuid.0x6.0x0:        EAX=0017aff7, EBX=00000002, ECX=00000009, EDX=00000000
[   48.641091] pfc: cpuid.0xA.0x0:        EAX=08300805, EBX=00000000, ECX=0000000f, EDX=00008604
[   48.641092] pfc: cpuid.0x80000000.0x0: EAX=80000008, EBX=00000000, ECX=00000000, EDX=00000000
[   48.641093] pfc: cpuid.0x80000001.0x0: EAX=00000000, EBX=00000000, ECX=00000121, EDX=2c100800
[   48.641095] pfc: cpuid.0x80000002.0x0: EAX=65746e49, EBX=2952286c, ECX=726f4320, EDX=4d542865
[   48.641096] pfc: cpuid.0x80000003.0x0: EAX=37692029, EBX=3630312d, ECX=20374735, EDX=20555043
[   48.641097] pfc: cpuid.0x80000004.0x0: EAX=2e312040, EBX=48473033, ECX=0000007a, EDX=00000000
[   48.641098] pfc: ERROR: Unsupported performance monitoring architecture version 5, only 3 or 4 supported!
[   48.641098] pfc: ERROR: Failed to load module pfc.

Was able to get the module running just changing https://github.com/obilaniu/libpfc/blob/master/kmod/pfckmod.c#L930 to include arch version 5 and run demo and obv there where some issues. The output of pfcdemo seemed fine but dmesg told another story.

Heres the dmesg:

[  434.231577] pfc: Kernel Module loading on processor Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Family 6 (6), Model 126 (07E), Stepping 5 (5))
[  434.231580] pfc: cpuid.0x0.0x0:        EAX=0000001b, EBX=756e6547, ECX=6c65746e, EDX=49656e69
[  434.231581] pfc: cpuid.0x1.0x0:        EAX=000706e5, EBX=01100800, ECX=7ffafbbf, EDX=bfebfbff
[  434.231582] pfc: cpuid.0x6.0x0:        EAX=0017aff7, EBX=00000002, ECX=00000009, EDX=00000000
[  434.231583] pfc: cpuid.0xA.0x0:        EAX=08300805, EBX=00000000, ECX=0000000f, EDX=00008604
[  434.231584] pfc: cpuid.0x80000000.0x0: EAX=80000008, EBX=00000000, ECX=00000000, EDX=00000000
[  434.231586] pfc: cpuid.0x80000001.0x0: EAX=00000000, EBX=00000000, ECX=00000121, EDX=2c100800
[  434.231587] pfc: cpuid.0x80000002.0x0: EAX=65746e49, EBX=2952286c, ECX=726f4320, EDX=4d542865
[  434.231588] pfc: cpuid.0x80000003.0x0: EAX=37692029, EBX=3630312d, ECX=20374735, EDX=20555043
[  434.231589] pfc: cpuid.0x80000004.0x0: EAX=2e312040, EBX=48473033, ECX=0000007a, EDX=00000000
[  434.231590] pfc: PM Arch Version:      5
[  434.231591] pfc: Fixed-function  PMCs: 4 Mask 0000ffffffffffff (48 bits)
[  434.231593] pfc: General-purpose PMCs: 8 Mask 0000ffffffffffff (48 bits)
[  434.231627] pfc: Module pfc loaded successfully. Make sure to execute
[  434.231628] pfc:     modprobe -ar iTCO_wdt iTCO_vendor_support
[  434.231629] pfc:     echo 0 > /proc/sys/kernel/nmi_watchdog
[  434.231629] pfc: and blacklist iTCO_vendor_support and iTCO_wdt, since the CR4.PCE register
[  434.231630] pfc: initialization is periodically undone by an unknown agent.
[  769.301462] unchecked MSR access error: WRMSR to 0x38d (tried to write 0x0000000000006222) at rIP: 0xffffffffc0b235ab (pfcWRMSR+0x6b/0x2e0 [pfc])
[  769.301463] Call Trace:
[  769.301466]  pfcCfgWr+0x11a/0x360 [pfc]
[  769.301469]  ? _cond_resched+0x1a/0x50
[  769.301471]  sysfs_kf_bin_write+0x5c/0x70
[  769.301472]  kernfs_fop_write+0xda/0x1b0
[  769.301474]  vfs_write+0xc9/0x200
[  769.301475]  __x64_sys_pwrite64+0x93/0xc0
[  769.301477]  do_syscall_64+0x49/0xc0
[  769.301478]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  769.301479] RIP: 0033:0x7f2cb7e08d5a
[  769.301480] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[  769.301480] RSP: 002b:00007ffcaf02aba8 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[  769.301482] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2cb7e08d5a
[  769.301482] RDX: 0000000000000038 RSI: 00007ffcaf02ac60 RDI: 0000000000000003
[  769.301482] RBP: 00007ffcaf02abd0 R08: 0000000000000002 R09: 0000000000000000
[  769.301483] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e9e7c46220
[  769.301483] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  816.752386] pfc: Module exiting...
[  816.752509] pfc: Module exited.

Wondering if 1) How plan to add support for arch 5. (figure you have all the info from dmesg if you choose to do this. From the comments looks like this was written for skylake and earlier and I'm on icelake-client so maybe thats it. if so probably not too much to add support. if your not particularly interested I might try out a PR) 2) How serious is this error message in dmesg towards not causing kernel panics and towards the quality of the bench marks

Thanks.

obilaniu commented 3 years ago

@goldsteinn I wrote this small project for the purposes of some ultra-low-latency/overhead benchmarking on my own computer's CPU, an Intel Haswell microarchitecture processor (w/ perfmon v3). The intention was to bypass system calls by making the counters readable from userspace, and to configure them from inside the program too (but outside the benchmarked code).

Because there is seemingly no way to ask the Linux kernel's performance-monitoring subsystem what mapping of "events" to "counters" it applied, and there are important restrictions (not all events can be counted on every counter), I also gave this kernel module the ability to configure the counters itself. This configuration only really works on a single core (thus the pfcPinThread() call), but I had considered that acceptable because the microbenchmarking I had been doing is single-threaded performance anyways. Furthermore, there was no easy way to migrate counter values with the process when it is rescheduled to a different core except by using the Linux kernel's performance-monitoring subsystem.

The fact that this code configures the counters itself had several important consequences. It needs detailed knowledge of, and read/write access to, several CPU MSRs (model-specific registers), in order to safely program the perfmon subsystem. By delving deep into the Intel SDM, I was able to write that code for my own (Intel Haswell) CPU. The MSR addresses I write to are hardcoded in libpfcmsr.h, and the table of events and their codes for Haswell is in libpfc.c.

Every change of the perfmon architecture version puts the MSR numbering and event codes in danger. As it turns out the v4 was still somewhat backwards compatible, so I allowed it. But v5 seems to have broken it (as in here).

I now realize that the approach I took (baking in the constants for Haswell as C source code) was never going to scale in the long term. It was enormously labour-intensive, but worthwhile when I was working only on my own CPU. Scaling it to more CPU microarchitectures means writing CPU-detection code and more and more logic for every CPU out there. I can't do it, nor test it, when I only have one CPU model available and limited time.

If I had to do this project again, I would probably use libpfm, as suggested by Travis Downs, or maybe JSON/YAML files with the counters and events in declarative form. Or maybe I would simply accept using perf, PAPI, and other high-overhead APIs, but with very high loop trip counts.


As a post-script, there is a further complication you might be hitting here. The trend within the Linux kernel is, more and more, to control and lock things down. This includes MSRs, the access to which in recent Linux kernel releases has been tightened. In general, performance monitoring components are also security risks because they're side-channels through which valuable information can leak (e.g. speculative data fetch/control flow, Meltdown, Spectre, ...).

So I am being strangled by newer Linux kernels, which would rather not give low-overhead, unprivileged, read/write access to performance counters and their configuration. I can't easily fight that. These days, I usually use perf. This project is in some sense a vestige, a futile effort in light of the clear will of Linux kernel developers to force the usage of their entry points. But as a non-trivial Linux kernel module + userspace program pair, it still has some educational value, for others as well as myself, so I leave it online. It's my most-starred open-source project, for some reason.

goldsteinn commented 3 years ago

Ahh. Bummer this was a great project. Learned a lot from going over the source code (ended up writing a stupider version in userspace with perf_event_open to do all the msr stuff then assembly for the critical sections).

the manual says I should be able to get V3/V4 to workout if need be:

Processors supporting architectural performance monitoring version 5 also support versions 1, 2, 3 and 4

So I can probably get it to work if need be.

obilaniu commented 3 years ago

@goldsteinn If the manual says that v5 is compatible with v3/v4, then definitely have a look at my functions for reading/writing MSRs. It's either that the Linux kernel is intercepting calls to its wrapper, or that my MSR addresses are wrong.

goldsteinn commented 3 years ago

@obilaniu ahh. I was thinking I could roll back my arch version or something. Looking at the MSR numbers I noticed a few differences (with the temperature demo for example: MSR_CORE_PERF_LIMIT_REASONS is now 0x64F it seems. If I write a PR for multi arch support would you consider looking at it or are you absolutely done maintaining?

edit: Rollback -> set some configuration to use an other version the hardware is compatible with

obilaniu commented 3 years ago

I would consider it. Do you have an idea if the renumbering of MSRs is common to all perfmon v5 processors, or is it particular to Ice Lake client?

I figure that either way, there is now going to be a need for runtime CPU detection (both at the user and kernel level), and the fixed MSR address macros will have to go the way of the dodo.

goldsteinn commented 3 years ago

@obilaniu I think it is processor specific (not arch version). The changes to arch version where pretty minimal.

Think you use the table:

MSRS IN THE 4TH GENERATION INTEL® CORE™ PROCESSORS (BASED ON HASWELL MICROARCHITECTURE)

Think just needs to be updated to include the new tables (there are a fair amount of them). libpfm doesn't seem to procure MSR info. Do you know any packages?

The arch version changes from 4 -> 5 are just:

Processors supporting architectural performance monitoring version 5 also support versions 1, 2, 3 and 4, as well as capability enumerated by CPUID leaf 0AH. Specifically, version 5 provides the following enhancements: • Deprecation of Anythread mode, see Section 18.2.5.1. • Individual enumeration of Fixed counters in CPUID.0AH, see Section 18.2.5.2.