Open dsnet opened 2 years ago
I don't know whether this issue is just with the Ryzen 5900X or all AMD chips.
I cannot reproduce on a Ryzen 3600 CPU (zen2) 3200mhz cl16 ram:
benchmark old ns/op new ns/op delta
BenchmarkSame-12 19689 20289 +3.05%
BenchmarkOrdered-12 32540 27448 -15.65%
BenchmarkRandom-12 139527 93838 -32.75%
It would be nice to confirm with an other zen3 cpu to know if that a problem for the complete generation.
More Zen 2 results (that can't reproduce the problem):
goos: linux goarch: amd64 cpu: AMD Ryzen 9 3950X 16-Core Processor name old time/op new time/op delta Same-32 19.9µs ± 4% 19.9µs ± 2% ~ (p=1.000 n=10+10) Ordered-32 34.1µs ± 4% 27.1µs ± 3% -20.68% (p=0.000 n=10+10) Random-32 127µs ± 1% 92µs ± 1% -27.19% (p=0.000 n=10+10)
More for AMD Ryzen 7 PRO 4750U
AMD Ryzen 7 5800X
goos: linux goarch: amd64 cpu: AMD Ryzen 7 5800X 8-Core Processor name old time/op new time/op delta Same-16 17.2µs ± 0% 15.1µs ± 1% -12.53% Ordered-16 25.9µs ± 3% 73.2µs ± 1% +182.54% Random-16 35.9µs ± 1% 71.0µs ± 2% +97.93%
Thanks for results so far. Here's a summary for AMD chips: | CPU | MicroArch | Same Delta | Ordered Delta | Random Delta | Source |
---|---|---|---|---|---|---|
Ryzen 5 3400G | Zen 2 | +4% | ~ | -24% | from @soypat | |
Ryzen 5 3600 | Zen 2 | +3% | -16% | -33% | from @Jorropo | |
Ryzen 7 PRO 4750U | Zen 2 | ~ | -21% | -26% | from @florianl | |
Ryzen 9 3900X | Zen 2 | -11% | -15% | -32% | from @inkeliz | |
Ryzen 9 3950X | Zen 2 | ~ | -21% | -27% | from @dominikh | |
EPYC 7502P | Zen 2 | -22% | -3% | -37% | from @inkeliz | |
Ryzen 7 5800X | Zen 3 | -13% | +183% | +98% | from @hhoughgg | |
Ryzen 7 PRO 5850U | Zen 3 | ~ | +182% | +109% | from @mvdan | |
Ryzen 9 5900X | Zen 3 | +217% | +227% | +88% | from @dsnet |
AMD Ryzen 3900X (Windows, 64GB 2666MHZ ECC):
AMD EPYC 7502P (Debian, 512GB 2933MHZ ECC):
Zen 2 arch AMD Ryzen 5 3400G with Radeon Vega Graphics
I can confirm that mobile zen3 is also affected; AMD Ryzen 7 PRO 5850U with Radeon Graphics
on linux/amd64
, idle and plugged in so it benchmarks at ~25W, showing parent commit versus the change:
name old time/op new time/op delta
Same-16 18.3µs ± 2% 18.2µs ± 1% ~ (p=0.394 n=6+6)
Ordered-16 27.6µs ± 2% 77.8µs ± 1% +181.89% (p=0.002 n=6+6)
Random-16 37.4µs ± 0% 78.1µs ± 1% +108.62% (p=0.010 n=4+6)
Possibly this effect could be related to similar problems we saw with CMOV generation. Branches, if you occasionally guess right, can let loads happen in parallel that would otherwise be serialized. See the discussion starting at https://github.com/golang/go/issues/26306#issuecomment-405055313 I'm not convinced it is the same effect, but maybe. It would be worth examining the benchmark to see if that effect is possible, and if so maybe trying prefetches to see if that reduces the effect.
Also possibly the loss in performance has something to do with Spectre mitigation?
Also possibly the loss in performance has something to do with Spectre mitigation?
Worth noting that I got my slow-down numbers above on Linux with mitigations=off
, but I do use up-to-date BIOS firmware and AMD microcode. I'm not sure which of those you mean.
It would be worth examining the benchmark to see if that effect is possible, and if so maybe trying prefetches to see if that reduces the effect.
More than happy to help benchmark attempts at solving the regression, as I imagine few in the compiler team have zen3 machines.
It's possible that this report on reddit stems from this issue.
It's possible that this report on reddit stems from this issue.
Possibly unrelated, OP reported it's more likely a NUMA issue, as disabling the second CCD makes it go faster: https://www.reddit.com/r/golang/comments/y4w74o/comment/isjxqby/
Zen4 is also affected.
I use 1.18.10 and 1.21.3.
On my AMD Ryzen 5900x, I noticed a drop in performance after jump tables were added (1ba96d8c0909eca59e28c048150c3834982f79fb).
The CL warns that microbenchmarks will favor the older binary search approach, while jump tables will be better utilized in larger benchmarks since it uses fewer hardware resources (branch predictors, etc.). To measure, this I tried to craft a micro-benchmark that would exercise edge cases of either approach. This is what I came up with:
Essentially,
KindString
is just likereflect.Kind.String
but uses a giant switch statement. TheSame
,Ordered
, andRandom
benchmarks keep callingKindString
with either the same argument, sequentially ordered arguments (i.e., easily predictable), or completely random arguments. I would expect jump tables to perform better onRandom
over the binary search approach since binary search would continually run log2(27) layers of branch prediction failures.I ran the benchmarks using a Go toolchain built at 1ba96d8c0909eca59e28c048150c3834982f79fb and also the commit right before it.
On at least 3 CPUs that I have access to, I see promising results:
Here, jump tables do better than binary search on all cases. Great!
Here, jump tables does worse on
Ordered
, but has much better performance onRandom
. Somewhat expected.Here, this is a Digital Ocean droplet, which is probably an Intel Xeon processor of some kind. Here, we see jump tables do better than binary search on all cases. Great again!
However, things turn for the worse on an AMD Ryzen 5900X:
Here, the jump tables are dramatically worse on all cases. I'm surprised that it's worse by 2x on the
Random
benchmark, which is where I expect jump tables to outshine binary search.I understand that performance improvements are really hard to quantify since what works well for one CPU design (e.g., Intel) may not be optimal for another CPU design (e.g., AMD). However, these results are much more drastic than I was expecting. I don't know whether this issue is just with the Ryzen 5900X or all AMD chips. I also, don't know if this is going to result in any changes to the implementation of jump tables, but I think this is a surprising result worth reporting.
\cc @randall77 @cherrymui