Open darkk opened 1 month ago
Sounds good, I'll test. Thanks.
Thanks! I'm looking forward to your comments and/or approval to continue submitting a few more chunks from my patch stack :-)
Also as branch darkk-cpu-mhz for testing on more machines
for testing on more machines
:+1:
Would it be useful if I upload results from my machines to this PR as well?
For a /proc/cpuinfo 3400 it calcs 2712. I'd rather prefer reading /proc/cpuinfo first and only fallback to this then.
and for a bigger machine of mine with 3550 MHz (AMD Ryzen 9 7950X3D 16-Core) it says: WARNING: timer resolution is 84 (0x54) ticks (0x362a3e8e0a729e - 0x362a3e8e0a724a). Broken VDSO? and calcs 4200.
However do I understand it correctly that is it exactly the same
rdtsc()
that is used to calc cycles/hash and is a constant-rate counter since circa-Pentium-4 and does not depend on CPU frequency anymore?
No, look exactly at how how we perform the timer start and stop. It's better than a mere rdtsc
It's better than a mere rdtsc
That's true, it implements barriers and that's important.
However, as far as I see, it still uses the same 64-bit MSR read by RDTSC
/ RDTSCP
and the counter might tick at the constant rate and, seems, that's exactly what constant_tsc
flag mean.
I'm looking at Intel 64 and IA-32 Architectures Software Developer Manual v083 at 19.17 TIME-STAMP COUNTER paragraph of the volume 3A (page 3763 in the PDF). It states the following:
For Pentium 4 processors, Intel Xeon processors (…); for Intel Core Solo and Intel Core Duo processors (…); for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors (…); for Intel Core 2 and Intel Xeon processors (…); for Intel Atom processors (…): the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency, see Section 21.7.2 for more detail. On certain processors, the TSC frequency may not be the same as the frequency in the brand string. The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.
As far as I understand, it's quite different from cycle counter coming from performance registers. Am I getting it wrong and/or looking at the wrong place altogether?
I've tested the branch on Intel(R) Pentium(R) M processor 1.50GHz (family: 0x6, model: 0xd, stepping: 0x8)
running Linux version 5.15.162 (builder@buildhost) (i486-openwrt-linux-musl-gcc (OpenWrt GCC 12.3.0 r24012-d8dd03c46f) 12.3.0, GNU ld (GNU Binutils) 2.40.0) #0 Mon Jul 15 22:14:18 2024
.
It's an old Intel CPU that comes from era of ia-32-ia-64-benchmark-code-execution-paper.pdf
, has no constant_tsc
feature and the branch works correctly under that assumptions. So it highlights that suspicious-looking MHz value on a modern CPU is probably related to constant_tsc
and is not a regression.
I've tuned /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
and scaling_max_freq
and got expected results:
scaling_*_freq , min/max |
Hash | MHz |
---|---|---|
600 000 / 1 500 000 | wyhash32 | 991.38 MiB/sec @ 1493 MHz |
600 000 / 1 500 000 | donothing32 | 7455000.00 MiB/sec @ 1491 MHz |
600 000 / 1 500 000 | donothing64 | 7450000.00 MiB/sec @ 1490 MHz |
600 000 | donothing64 | 2990000.00 MiB/sec @ 598 MHz |
900 000 | donothing64 | 4490000.00 MiB/sec @ 898 MHz |
1 100 000 | donothing32 | 5485000.00 MiB/sec @ 1097 MHz |
1 100 000 | wyhash32 | 658.72 MiB/sec @ 1097 MHz |
1 500 000 | donothing64 | 7480000.00 MiB/sec @ 1496 MHz |
1 500 000 | sha2-256 | 47.49 MiB/sec @ 1496 MHz |
I assume that frequency fluctuates for 600/1500 case a bit due to dynamic scaling kicking in at the startup time. The measurement is quite stable across runs for other cases.
dmidecode
:
Processor Information
Socket Designation: None
Type: Central Processor
Family: Pentium M
Manufacturer: GenuineIntel
ID: D8 06 00 00 FF FB E9 AF
Signature: Type 0, Family 6, Model 13, Stepping 8
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
CLFSH (CLFLUSH instruction supported)
DS (Debug store)
ACPI (ACPI supported)
MMX (MMX technology supported)
FXSR (FXSAVE and FXSTOR instructions supported)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
SS (Self-snoop)
TM (Thermal monitor supported)
PBE (Pending break enabled)
Version: Intel(R) Pentium(R) M processor
Voltage: 1.1 V
External Clock: 400 MHz
Max Speed: 1500 MHz
Current Speed: 1500 MHz
Status: Populated, Enabled
Upgrade: None
L1 Cache Handle: 0x000A
L2 Cache Handle: 0x000B
L3 Cache Handle: Not Provided
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
/proc/cpuinfo
:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 1.50GHz
stepping : 8
microcode : 0x20
cpu MHz : 600.000
cache size : 2048 KB
fdiv_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx bts cpuid est tm2
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 1197.13
clflush size : 64
cache_alignment : 64
address sizes : 32 bits physical, 32 bits virtual
power management:
@rurban please, suggest me, how should I proceed with this branch?
I think of integrating djb's libcpucycles
into SMhasher to get more accurate CPU cycles measurement on modern CPUs, but it'll make the diff somewhat larger and harder to review.
As I said: read from /proc/cpuinfo and as fallback your method. or djb's -lcpucycles, which looks also good.
It makes timer accounting issues like #241 more visible.
The overall logic is to sample (wall-clock-ns, cycle-counter) pairs and take the longest possible valid interval out of 2,999 SpeedTest() calls during
--test=SpeedBulk
.The code has to provide reasonable MHz estimates accounting for:
donothing32
sha3-256
GetCpuFreqMHz()
That's why the code samples the pairs on every SpeedTest() call and not just twice in some "good" places.