cmd/compile: understand Go vs assembly rc4 results

bradfitz commented 6 years ago

Tracking bug so somebody (@randall77?) looks into why we got totally opposite performance numbers on different Intel CPUs when we deleted the crypto/rc4 code's assembly in "favor" of the Go version in 30eda6715c6578de2086f03df36c4a8def838ec2 (https://golang.org/cl/130397):

name       old time/op   new time/op   delta
RC4_128-4    256ns ± 0%    196ns ± 0%  -23.78%  (p=0.029 n=4+4)
RC4_1K-4    2.38µs ± 0%   1.54µs ± 0%  -35.22%  (p=0.029 n=4+4)
RC4_8K-4    19.4µs ± 1%   12.0µs ± 0%  -38.35%  (p=0.029 n=4+4)

name       old speed     new speed     delta
RC4_128-4  498MB/s ± 0%  654MB/s ± 0%  +31.12%  (p=0.029 n=4+4)
RC4_1K-4   431MB/s ± 0%  665MB/s ± 0%  +54.34%  (p=0.029 n=4+4)
RC4_8K-4   418MB/s ± 1%  677MB/s ± 0%  +62.18%  (p=0.029 n=4+4)

vendor_id   : GenuineIntel
cpu family  : 6
model       : 142
model name  : Intel(R) Core(TM) i5-7Y54 CPU @ 1.20GHz
stepping    : 9
microcode   : 0x84
cpu MHz     : 800.036
cache size  : 4096 KB

name       old time/op   new time/op   delta
RC4_128-4    235ns ± 1%    431ns ± 0%  +83.00%  (p=0.000 n=10+10)
RC4_1K-4    1.74µs ± 0%   3.41µs ± 0%  +96.74%  (p=0.000 n=10+10)
RC4_8K-4    13.6µs ± 1%   26.8µs ± 0%  +97.58%   (p=0.000 n=10+9)

name       old speed     new speed     delta
RC4_128-4  543MB/s ± 0%  297MB/s ± 1%  -45.29%  (p=0.000 n=10+10)
RC4_1K-4   590MB/s ± 0%  300MB/s ± 0%  -49.16%  (p=0.000 n=10+10)
RC4_8K-4   596MB/s ± 1%  302MB/s ± 0%  -49.39%   (p=0.000 n=10+9)

vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU @ 2.30GHz
stepping        : 0
microcode       : 0x1
cpu MHz         : 2300.000
cache size      : 46080 KB

Super mysterious, so we might want to understand it enough to decide whether we care and whether there's something the compiler might do better for more CPUs.

Maybe the benchmarks or benchstat are wrong? But then that'd be its own interesting bug.

/cc @josharian @aead @FiloSottile

aead commented 6 years ago

Again on linux/amd64 I see the following result for 1.10, 1.11rc2 and tip.

name       old time/op   new time/op   delta
RC4_128-4    262ns ± 1%    204ns ± 0%  -22.48%  (p=0.029 n=4+4)
RC4_1K-4    2.46µs ± 0%   1.62µs ± 0%  -34.16%  (p=0.029 n=4+4)
RC4_8K-4    19.7µs ± 0%   12.7µs ± 0%  -35.71%  (p=0.029 n=4+4)

name       old speed     new speed     delta
RC4_128-4  487MB/s ± 1%  627MB/s ± 0%  +28.71%  (p=0.029 n=4+4)
RC4_1K-4   416MB/s ± 0%  631MB/s ± 0%  +51.91%  (p=0.029 n=4+4)
RC4_8K-4   410MB/s ± 0%  638MB/s ± 0%  +55.55%  (p=0.029 n=4+4)

vendor_id   : GenuineIntel
cpu family  : 6
model       : 78
model name  : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
stepping    : 3
microcode   : 0xc2
cpu MHz     : 546.356
cache size  : 4096 KB

Taking the the Macbook Pro (Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz stats of CL-102255 into account it seems like the performance is reduced only on server chips (Xeon)... Further the assembler code generated by the compiler for the generic implementation does not use any SIMD (SSE/SSE2) instructions while the manually written amd64 code does...

FiloSottile commented 6 years ago

It was suggested to me in private that the assembly might be penalized by a slowdown in accessing xmm registers.

the latency of moving things between general-purpose and xmm registers seems to have increased with Skylake, so those pinsrw instructions are probably more costly, and they are in the critical path

Also see the replies to https://twitter.com/FiloSottile/status/1031994832460210176

However, I still find confusing that the compiled code, which should not be doing anything special, is faster on the 1.2GHz CPU than it is on the 2.3GHz one.

uluyol commented 6 years ago

The 1.2 GHz processor can turbo to 3.2 GHz.

agnivade commented 6 years ago

@TocarIP @Quasilyte

magical commented 6 years ago

Just to add another data point... I tried this on my work machine and got similar numbers to Brad's.

name       old time/op   new time/op   delta
RC4_128-8    183ns ± 1%    335ns ± 1%  +82.55%  (p=0.000 n=16+19)
RC4_1K-8    1.37µs ± 4%   2.65µs ± 1%  +94.01%  (p=0.000 n=20+19)
RC4_8K-8    10.6µs ± 1%   20.9µs ± 1%  +97.31%  (p=0.000 n=18+19)

name       old speed     new speed     delta
RC4_128-8  696MB/s ± 2%  382MB/s ± 1%  -45.09%  (p=0.000 n=17+19)
RC4_1K-8   748MB/s ± 4%  386MB/s ± 1%  -48.45%  (p=0.000 n=20+19)
RC4_8K-8   766MB/s ± 1%  388MB/s ± 1%  -49.32%  (p=0.000 n=18+19)

vendor_id   : GenuineIntel
cpu family  : 6
model       : 63
model name  : Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
stepping    : 2
microcode   : 54
cpu MHz     : 1200.000
cache size  : 10240 KB
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc

TocarIP commented 6 years ago

Looks like there are at least 2 issues: 1) Asm version is slower on skylake than haswell and has lower IPC (instructions per cycle) 2.14 vs 1.47 2) More importantly pure go version has significantly higher IPC on skylake (and is faster) 3.37 vs 1.84

TocarIP commented 6 years ago

I think, I have a hypothesis, explaining code running much faster on skylake. Perf shows 1,466,664,692 resource_stalls.rs events, which means that for 1.4 *10^9 cycles reservation station couldn't accept uops. Skylake has 50% higher reservation station capacity, which should allow to interleave more iterations.

However I don't think that this explanation provides actionable advise (having less instruction and shorter dependency chains is already a goal)

randall77 commented 5 years ago

Punting to unplanned, too late for anything major in 1.12.

golang / go

cmd/compile: understand Go vs assembly rc4 results #27184