Open bradfitz opened 6 years ago
Again on linux/amd64 I see the following result for 1.10, 1.11rc2 and tip.
name old time/op new time/op delta
RC4_128-4 262ns ± 1% 204ns ± 0% -22.48% (p=0.029 n=4+4)
RC4_1K-4 2.46µs ± 0% 1.62µs ± 0% -34.16% (p=0.029 n=4+4)
RC4_8K-4 19.7µs ± 0% 12.7µs ± 0% -35.71% (p=0.029 n=4+4)
name old speed new speed delta
RC4_128-4 487MB/s ± 1% 627MB/s ± 0% +28.71% (p=0.029 n=4+4)
RC4_1K-4 416MB/s ± 0% 631MB/s ± 0% +51.91% (p=0.029 n=4+4)
RC4_8K-4 410MB/s ± 0% 638MB/s ± 0% +55.55% (p=0.029 n=4+4)
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
stepping : 3
microcode : 0xc2
cpu MHz : 546.356
cache size : 4096 KB
Taking the the Macbook Pro (Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz stats of CL-102255 into account it seems like the performance is reduced only on server chips (Xeon)... Further the assembler code generated by the compiler for the generic implementation does not use any SIMD (SSE/SSE2) instructions while the manually written amd64 code does...
It was suggested to me in private that the assembly might be penalized by a slowdown in accessing xmm registers.
the latency of moving things between general-purpose and xmm registers seems to have increased with Skylake, so those
pinsrw
instructions are probably more costly, and they are in the critical path
Also see the replies to https://twitter.com/FiloSottile/status/1031994832460210176
However, I still find confusing that the compiled code, which should not be doing anything special, is faster on the 1.2GHz CPU than it is on the 2.3GHz one.
@TocarIP @Quasilyte
Just to add another data point... I tried this on my work machine and got similar numbers to Brad's.
name old time/op new time/op delta
RC4_128-8 183ns ± 1% 335ns ± 1% +82.55% (p=0.000 n=16+19)
RC4_1K-8 1.37µs ± 4% 2.65µs ± 1% +94.01% (p=0.000 n=20+19)
RC4_8K-8 10.6µs ± 1% 20.9µs ± 1% +97.31% (p=0.000 n=18+19)
name old speed new speed delta
RC4_128-8 696MB/s ± 2% 382MB/s ± 1% -45.09% (p=0.000 n=17+19)
RC4_1K-8 748MB/s ± 4% 386MB/s ± 1% -48.45% (p=0.000 n=20+19)
RC4_8K-8 766MB/s ± 1% 388MB/s ± 1% -49.32% (p=0.000 n=18+19)
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
stepping : 2
microcode : 54
cpu MHz : 1200.000
cache size : 10240 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc
Looks like there are at least 2 issues: 1) Asm version is slower on skylake than haswell and has lower IPC (instructions per cycle) 2.14 vs 1.47 2) More importantly pure go version has significantly higher IPC on skylake (and is faster) 3.37 vs 1.84
I think, I have a hypothesis, explaining code running much faster on skylake. Perf shows 1,466,664,692 resource_stalls.rs events, which means that for 1.4 *10^9 cycles reservation station couldn't accept uops. Skylake has 50% higher reservation station capacity, which should allow to interleave more iterations.
However I don't think that this explanation provides actionable advise (having less instruction and shorter dependency chains is already a goal)
Punting to unplanned, too late for anything major in 1.12.
Tracking bug so somebody (@randall77?) looks into why we got totally opposite performance numbers on different Intel CPUs when we deleted the crypto/rc4 code's assembly in "favor" of the Go version in 30eda6715c6578de2086f03df36c4a8def838ec2 (https://golang.org/cl/130397):
Super mysterious, so we might want to understand it enough to decide whether we care and whether there's something the compiler might do better for more CPUs.
Maybe the benchmarks or benchstat are wrong? But then that'd be its own interesting bug.
/cc @josharian @aead @FiloSottile