rurban / smhasher

Hash function quality and speed tests
https://rurban.github.io/smhasher/
Other
1.86k stars 179 forks source link

Report CPU frequency as measured by rdtsc() instead of 3 GHz #301

Open darkk opened 1 month ago

darkk commented 1 month ago

It makes timer accounting issues like #241 more visible.

The overall logic is to sample (wall-clock-ns, cycle-counter) pairs and take the longest possible valid interval out of 2,999 SpeedTest() calls during --test=SpeedBulk.

The code has to provide reasonable MHz estimates accounting for:

That's why the code samples the pairs on every SpeedTest() call and not just twice in some "good" places.

rurban commented 1 month ago

Sounds good, I'll test. Thanks.

darkk commented 1 month ago

Thanks! I'm looking forward to your comments and/or approval to continue submitting a few more chunks from my patch stack :-)

rurban commented 1 month ago

Also as branch darkk-cpu-mhz for testing on more machines

darkk commented 1 month ago

for testing on more machines

:+1:

Would it be useful if I upload results from my machines to this PR as well?

rurban commented 1 month ago

For a /proc/cpuinfo 3400 it calcs 2712. I'd rather prefer reading /proc/cpuinfo first and only fallback to this then.

and for a bigger machine of mine with 3550 MHz (AMD Ryzen 9 7950X3D 16-Core) it says: WARNING: timer resolution is 84 (0x54) ticks (0x362a3e8e0a729e - 0x362a3e8e0a724a). Broken VDSO? and calcs 4200.

darkk commented 1 month ago

However do I understand it correctly that is it exactly the same rdtsc() that is used to calc cycles/hash and is a constant-rate counter since circa-Pentium-4 and does not depend on CPU frequency anymore?

No, look exactly at how how we perform the timer start and stop. It's better than a mere rdtsc

darkk commented 1 month ago

It's better than a mere rdtsc

That's true, it implements barriers and that's important.

However, as far as I see, it still uses the same 64-bit MSR read by RDTSC / RDTSCP and the counter might tick at the constant rate and, seems, that's exactly what constant_tsc flag mean.

I'm looking at Intel 64 and IA-32 Architectures Software Developer Manual v083 at 19.17 TIME-STAMP COUNTER paragraph of the volume 3A (page 3763 in the PDF). It states the following:

For Pentium 4 processors, Intel Xeon processors (…); for Intel Core Solo and Intel Core Duo processors (…); for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors (…); for Intel Core 2 and Intel Xeon processors (…); for Intel Atom processors (…): the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency, see Section 21.7.2 for more detail. On certain processors, the TSC frequency may not be the same as the frequency in the brand string. The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.

As far as I understand, it's quite different from cycle counter coming from performance registers. Am I getting it wrong and/or looking at the wrong place altogether?

darkk commented 4 weeks ago

I've tested the branch on Intel(R) Pentium(R) M processor 1.50GHz (family: 0x6, model: 0xd, stepping: 0x8) running Linux version 5.15.162 (builder@buildhost) (i486-openwrt-linux-musl-gcc (OpenWrt GCC 12.3.0 r24012-d8dd03c46f) 12.3.0, GNU ld (GNU Binutils) 2.40.0) #0 Mon Jul 15 22:14:18 2024.

It's an old Intel CPU that comes from era of ia-32-ia-64-benchmark-code-execution-paper.pdf, has no constant_tsc feature and the branch works correctly under that assumptions. So it highlights that suspicious-looking MHz value on a modern CPU is probably related to constant_tsc and is not a regression.

I've tuned /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq and scaling_max_freq and got expected results:

scaling_*_freq, min/max Hash MHz
600 000 / 1 500 000 wyhash32 991.38 MiB/sec @ 1493 MHz
600 000 / 1 500 000 donothing32 7455000.00 MiB/sec @ 1491 MHz
600 000 / 1 500 000 donothing64 7450000.00 MiB/sec @ 1490 MHz
600 000 donothing64 2990000.00 MiB/sec @ 598 MHz
900 000 donothing64 4490000.00 MiB/sec @ 898 MHz
1 100 000 donothing32 5485000.00 MiB/sec @ 1097 MHz
1 100 000 wyhash32 658.72 MiB/sec @ 1097 MHz
1 500 000 donothing64 7480000.00 MiB/sec @ 1496 MHz
1 500 000 sha2-256 47.49 MiB/sec @ 1496 MHz

I assume that frequency fluctuates for 600/1500 case a bit due to dynamic scaling kicking in at the startup time. The measurement is quite stable across runs for other cases.

dmidecode:

Processor Information
        Socket Designation: None
        Type: Central Processor
        Family: Pentium M
        Manufacturer: GenuineIntel
        ID: D8 06 00 00 FF FB E9 AF
        Signature: Type 0, Family 6, Model 13, Stepping 8
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (FXSAVE and FXSTOR instructions supported)
                SSE (Streaming SIMD extensions)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                TM (Thermal monitor supported)
                PBE (Pending break enabled)
        Version: Intel(R) Pentium(R) M processor
        Voltage: 1.1 V
        External Clock: 400 MHz
        Max Speed: 1500 MHz
        Current Speed: 1500 MHz
        Status: Populated, Enabled
        Upgrade: None
        L1 Cache Handle: 0x000A
        L2 Cache Handle: 0x000B
        L3 Cache Handle: Not Provided
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified

/proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : Intel(R) Pentium(R) M processor 1.50GHz
stepping        : 8
microcode       : 0x20
cpu MHz         : 600.000
cache size      : 2048 KB
fdiv_bug        : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx bts cpuid est tm2
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips        : 1197.13
clflush size    : 64
cache_alignment : 64
address sizes   : 32 bits physical, 32 bits virtual
power management:

@rurban please, suggest me, how should I proceed with this branch?

I think of integrating djb's libcpucycles into SMhasher to get more accurate CPU cycles measurement on modern CPUs, but it'll make the diff somewhat larger and harder to review.

rurban commented 4 weeks ago

As I said: read from /proc/cpuinfo and as fallback your method. or djb's -lcpucycles, which looks also good.