Open jdmccalpin opened 6 years ago
At least for newer CPUs the "official" way to get the TSC frequency is through cpuid
leaf 0x15. You can see the code I use here for the cpuid
approach.
Other than that, I think the best approach is to use a calibration loop - here's the one I use although it could perhaps be improved. Still, it usually gets results which are stable to 4 or 5 significant figures.
Note that a calibration loop will work automatically for AMD also, so is applicable to this issue.
There is an argument to be made that the calibration loop is actually the best of all approaches, since modern cores often run at slightly different frequencies than nominal, due to some EMF reduction circuit or something that reduces frequency by between 0.25% and 1% or something like that. So the calibration approach can beat even the "perfect" approach of using cpuid
or reading the brand string (note on recent CPUs reading the brand string doesn't even give the exact result since TSC is decoupled from the CPU base clock, so for example the TSC might run at 2952 MHz while the CPU runs at 2600 MHz nominal).
Len Brown from Intel who keeps turbostat up to date on Linux in fact changed the TSC frequency code recently in SKX to stop using cpuid
and start using a loop. This approach was also adopted for the kernel.
Thanks for the update! Reviewing my historical data, it looks like this CPUID feature (which is long overdue) first appears in SKX. (I don't have any Skylake Client parts to check.) Even on SKX, the values are incomplete (i.e., no specification of the "core crystal clock frequency"), and I am certainly going to ding them points for not clearly pointing out that the "core crystal clock frequency" is different than the "bus (reference) frequency". This mess is repeated in Intel's irritatingly inconsistent implementation of the event CPU_CLK_THREAD_UNHALTED.REF_XCLK (and its historical variants).
I remember a discussion about the Skylake client parts having a 24 MHz core crystal clock frequency, which sounds horrible. I have not had any trouble with inconsistency between clocks on my SKX processors, but the "increment by 84" of the fixed-function "reference cycles not halted" counter fuzzes up the calculations....
I don't recall if I included this in the comments in the code, but the brand string approach was quite inexact for Nehalem/Westmere processors, since the used a 133 MHz base clock, but the brand string only gave the frequency to 0.1 GHz, so you had to recognize that you were on an old system and compensate. We retired our last Westmere cluster in 2016, I think, so it has not been a problem....
@jdmccalpin - the cpuid
technique appears on SKL, I am using it today there.
For the crystal frequencies, yes there is no specific way to get it from cpuid
, which is indeed stupid. However, you can look at the turbostat.c source which maps from family/model to crystal clock frequency. This file is generally kept quite up to date by Intel. Yes, Skylake client has 24 MHz frequency, but I guess it it not too terrible once you know about it.
For AMD Family 11h and newer the TSC increments at the "P0" speed (maximum frequency) that the core supports. There are a set of MSRs for each P-state that can be decoded to find the frequency: Core::Msr::PStateDef (starting at 0xc001 0064). Family 17h (Zen, Zen+, Zen 2) and 19h (Zen 3, Zen 3+, Zen4) use the same encoding of the bits -- a table lookup in bits 13:8 to get determine the divisor (value and interpretation) and a number in bits 7:0 that is multiplied by 25 MHz. Unfortunately the MSR read requires kernel access, which is outside of the scope of the low-overhead-timers project. AMD recommends running "cpupower frequency-info". Running that as root on an AMD 7763 (Milan) system gives:
# cpupower frequency-info
analyzing CPU 0:
no or unknown cpufreq driver is active on this CPU
CPUs which run at the same hardware frequency: Not Available
CPUs which need to have their frequency coordinated by software: Not Available
maximum transition latency: Cannot determine or is not supported.
Not Available
available cpufreq governors: Not Available
Unable to determine current policy
current CPU frequency: Unable to call hardware
current CPU frequency: Unable to call to kernel
boost state support:
Supported: yes
Active: yes
Boost States: 0
Total States: 3
Pstate-P0: 2450MHz
Pstate-P1: 2000MHz
Pstate-P2: 1500MHz
Reading MSR 0xc001 0064 (PstateDef for P0) gives a divisor of 1 and a multiplier of 0x62 (98 decimal), which corresponds to 2450 MHz. A quick sanity check agrees:
# ~mccalpin/bin/rdmsr -p 0 -u 0x10; sleep 1; ~mccalpin/bin/rdmsr -p 0 -u 0x10
30760512739154849
30760515191035837
# echo $(( 30760515191035837 - 30760512739154849 ))
2451880988 # 2,451,880,988 Hz
I have to say that it qualifies as genuine stupid to design a system that requires root privileges to determine something as simple as the invariant TSC frequency.
For low-overhead-timers on AMD systems, I think I will just require that the user figure out the TSC frequency some other way and pass it in an environment variable.
One more caveat -- the recent AMD processors have an additional facility to rescale the TSC frequency reported by this MSR in the MSR read comes from a guest OS. This allows the hypervisor re-scale the reported TSC frequency if the guest is moved from a CPU set with one TSC frequency to a CPU set with a different TSC frequency. Dunno if this means that the hypervisor is also scaling the values returned by the RDTSC(P) instructions -- virtualized systems are "not my problem" (TM).
I have to say that it qualifies as genuine stupid to design a system that requires root privileges to determine something as simple as the invariant TSC frequency.
Yes, it is very unfortunate. Arguably the kernel should expose this value in a standard way readable by user space. One way you can get these values is the time_offset
and related values in the "user page" approach to reading performance counters offered by perf_event_open
(see man page). H/e even doing that requires certain priviledges, like a suitable value of /proc/sys/kernel/perf_event_paranoid
.
For the three processors I have tested so far, the P0 frequency is equal to the marketing base frequency. I wrote a little Bash script to read the MSRs and compute the frequencies, and once I built a table to handle the divisors properly the results agree with "cpupower frequency-info", and match the "Base Clock" values from the AMD web site.
The three processors I tested were:
The values reported by /proc/cpuinfo or by reading the TSC before and after an interval are not exact matches to the nominal/marketing/base frequency, but they are fine for low-overhead timers. For the APYC 7763 (2450 MHz), cpuinfo reports a value within about 0.2%:
$ TSC_MHZ=`grep "^cpu MHz" /proc/cpuinfo | head -1 | awk '{print $4}'`
$ echo $TSC_MHZ
2445.321
The function "get_TSC_frequency()" uses a horrible hack to get the nominal CPU frequency from the CPUID brand string. The code is a port of Intel's reference code (from C++ to C), so this looks like a "correct" way of getting the data without depending on either privileged instructions (e.g., reading an MSR) or specific formats in files maintained by the OS (e.g., /proc/cpuinfo on Linux). I need an alternative that will work with AMD processors, but it has been too long since I have worked on them to remember if there is an easy way to do this....