cmd/compile: jump tables perform much worse on AMD zen3?

dsnet commented 2 years ago

On my AMD Ryzen 5900x, I noticed a drop in performance after jump tables were added (1ba96d8c0909eca59e28c048150c3834982f79fb).

The CL warns that microbenchmarks will favor the older binary search approach, while jump tables will be better utilized in larger benchmarks since it uses fewer hardware resources (branch predictors, etc.). To measure, this I tried to craft a micro-benchmark that would exercise edge cases of either approach. This is what I came up with:

func BenchmarkSame(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for _, k := range sourcesSame {
            sink = KindString(k)
        }
    }
}
func BenchmarkOrdered(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for _, k := range sourcesOrdered {
            sink = KindString(k)
        }
    }
}
func BenchmarkRandom(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for _, k := range sourcesRandom {
            sink = KindString(k)
        }
    }
}

Essentially, KindString is just like reflect.Kind.String but uses a giant switch statement. The Same, Ordered, and Random benchmarks keep calling KindString with either the same argument, sequentially ordered arguments (i.e., easily predictable), or completely random arguments. I would expect jump tables to perform better on Random over the binary search approach since binary search would continually run log2(27) layers of branch prediction failures.

I ran the benchmarks using a Go toolchain built at 1ba96d8c0909eca59e28c048150c3834982f79fb and also the commit right before it.

On at least 3 CPUs that I have access to, I see promising results:

cpu: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
benchmark              old ns/op     new ns/op     delta
BenchmarkSame-8        38859         31988         -17.68%
BenchmarkOrdered-8     41096         40540         -1.35%
BenchmarkRandom-8      154696        108772        -29.69%

Here, jump tables do better than binary search on all cases. Great!

cpu: Intel(R) Celeron(R) CPU J3355 @ 2.00GHz
benchmark              old ns/op     new ns/op     delta
BenchmarkSame-2        47493         44109         -7.13%
BenchmarkOrdered-2     81576         140271        +71.95%
BenchmarkRandom-2      240434        138600        -42.35%

Here, jump tables does worse on Ordered, but has much better performance on Random. Somewhat expected.

cpu: DO-Regular
benchmark            old ns/op     new ns/op     delta
BenchmarkSame        32203         23410         -27.30%
BenchmarkOrdered     35731         27531         -22.95%
BenchmarkRandom      179656        125754        -30.00%

Here, this is a Digital Ocean droplet, which is probably an Intel Xeon processor of some kind. Here, we see jump tables do better than binary search on all cases. Great again!

However, things turn for the worse on an AMD Ryzen 5900X:

cpu: AMD Ryzen 9 5900X 12-Core Processor            
benchmark               old ns/op     new ns/op     delta
BenchmarkSame-24        19514         61886         +217.14%
BenchmarkOrdered-24     22926         75008         +227.17%
BenchmarkRandom-24      38740         72986         +88.40%

Here, the jump tables are dramatically worse on all cases. I'm surprised that it's worse by 2x on the Random benchmark, which is where I expect jump tables to outshine binary search.

I understand that performance improvements are really hard to quantify since what works well for one CPU design (e.g., Intel) may not be optimal for another CPU design (e.g., AMD). However, these results are much more drastic than I was expecting. I don't know whether this issue is just with the Ryzen 5900X or all AMD chips. I also, don't know if this is going to result in any changes to the implementation of jump tables, but I think this is a surprising result worth reporting.

\cc @randall77 @cherrymui

Jorropo commented 2 years ago

I don't know whether this issue is just with the Ryzen 5900X or all AMD chips.

I cannot reproduce on a Ryzen 3600 CPU (zen2) 3200mhz cl16 ram:

benchmark               old ns/op     new ns/op     delta
BenchmarkSame-12        19689         20289         +3.05%
BenchmarkOrdered-12     32540         27448         -15.65%
BenchmarkRandom-12      139527        93838         -32.75%

It would be nice to confirm with an other zen3 cpu to know if that a problem for the complete generation.

dominikh commented 2 years ago

More Zen 2 results (that can't reproduce the problem):

goos: linux
goarch: amd64
cpu: AMD Ryzen 9 3950X 16-Core Processor
name        old time/op  new time/op  delta
Same-32     19.9µs ± 4%  19.9µs ± 2%     ~     (p=1.000 n=10+10)
Ordered-32  34.1µs ± 4%  27.1µs ± 3%  -20.68%  (p=0.000 n=10+10)
Random-32    127µs ± 1%    92µs ± 1%  -27.19%  (p=0.000 n=10+10)

florianl commented 2 years ago

More for AMD Ryzen 7 PRO 4750U

Details

``` goos: linux goarch: amd64 pkg: jumptables cpu: AMD Ryzen 7 PRO 4750U with Radeon Graphics benchmark old ns/op new ns/op delta BenchmarkSame-16 20669 20676 +0.03% BenchmarkOrdered-16 36472 28692 -21.33% BenchmarkRandom-16 131109 97300 -25.79% ```

hhoughgg commented 2 years ago

AMD Ryzen 7 5800X

goos: linux
goarch: amd64
cpu: AMD Ryzen 7 5800X 8-Core Processor
name        old time/op  new time/op  delta
Same-16     17.2µs ± 0%  15.1µs ± 1%   -12.53%  
Ordered-16  25.9µs ± 3%  73.2µs ± 1%  +182.54% 
Random-16   35.9µs ± 1%  71.0µs ± 2%   +97.93%

dsnet commented 2 years ago

Thanks for results so far. Here's a summary for AMD chips:	CPU	MicroArch	Same Delta	Ordered Delta	Random Delta
Ryzen 5 3400G	Zen 2	+4%	~	-24%	from @soypat
Ryzen 5 3600	Zen 2	+3%	-16%	-33%	from @Jorropo
Ryzen 7 PRO 4750U	Zen 2	~	-21%	-26%	from @florianl
Ryzen 9 3900X	Zen 2	-11%	-15%	-32%	from @inkeliz
Ryzen 9 3950X	Zen 2	~	-21%	-27%	from @dominikh
EPYC 7502P	Zen 2	-22%	-3%	-37%	from @inkeliz
Ryzen 7 5800X	Zen 3	-13%	+183%	+98%	from @hhoughgg
Ryzen 7 PRO 5850U	Zen 3	~	+182%	+109%	from @mvdan
Ryzen 9 5900X	Zen 3	+217%	+227%	+88%	from @dsnet

inkeliz commented 2 years ago

AMD Ryzen 3900X (Windows, 64GB 2666MHZ ECC):

``` name old time/op new time/op delta Same-24 20.9µs ± 1% 18.6µs ± 1% -11.02% (p=0.000 n=10+10) Ordered-24 30.0µs ± 2% 25.6µs ± 0% -14.75% (p=0.000 n=10+10) Random-24 130µs ± 2% 88µs ± 0% -32.26% (p=0.000 n=10+10) ``` (`old` = go1.18.2; `new`= go1.19-55590f3a2b)

AMD EPYC 7502P (Debian, 512GB 2933MHZ ECC):

``` name old time/op new time/op delta Same-64 27.2µs ± 0% 21.2µs ± 0% -22.12% (p=0.000 n=10+10) Ordered-64 37.8µs ± 2% 36.6µs ± 0% -3.20% (p=0.000 n=10+9) Random-64 170µs ± 1% 108µs ± 1% -36.66% (p=0.000 n=9+10) ```

soypat commented 2 years ago

Zen 2 arch AMD Ryzen 5 3400G with Radeon Vega Graphics

``` $ go version go1.18.2 linux/amd64 $ go test -bench=. -count=100 . > old.txt $ go1.19beta1 test -bench=. -count=35 . > new.txt $ benchstat old.txt new.txt name old time/op new time/op delta Same-8 23.3µs ± 5% 24.2µs ± 3% +4.16% (p=0.000 n=91+35) Ordered-8 29.8µs ± 4% 29.8µs ± 3% ~ (p=0.517 n=97+34) Random-8 132µs ± 5% 100µs ± 3% -24.34% (p=0.000 n=95+33) ```

mvdan commented 2 years ago

I can confirm that mobile zen3 is also affected; AMD Ryzen 7 PRO 5850U with Radeon Graphics on linux/amd64, idle and plugged in so it benchmarks at ~25W, showing parent commit versus the change:

name        old time/op  new time/op  delta
Same-16     18.3µs ± 2%  18.2µs ± 1%      ~     (p=0.394 n=6+6)
Ordered-16  27.6µs ± 2%  77.8µs ± 1%  +181.89%  (p=0.002 n=6+6)
Random-16   37.4µs ± 0%  78.1µs ± 1%  +108.62%  (p=0.010 n=4+6)

randall77 commented 2 years ago

Possibly this effect could be related to similar problems we saw with CMOV generation. Branches, if you occasionally guess right, can let loads happen in parallel that would otherwise be serialized. See the discussion starting at https://github.com/golang/go/issues/26306#issuecomment-405055313 I'm not convinced it is the same effect, but maybe. It would be worth examining the benchmark to see if that effect is possible, and if so maybe trying prefetches to see if that reduces the effect.

randall77 commented 2 years ago

Also possibly the loss in performance has something to do with Spectre mitigation?

mvdan commented 1 year ago

Also possibly the loss in performance has something to do with Spectre mitigation?

Worth noting that I got my slow-down numbers above on Linux with mitigations=off, but I do use up-to-date BIOS firmware and AMD microcode. I'm not sure which of those you mean.

mvdan commented 1 year ago

It would be worth examining the benchmark to see if that effect is possible, and if so maybe trying prefetches to see if that reduces the effect.

More than happy to help benchmark attempts at solving the regression, as I imagine few in the compiler team have zen3 machines.

dsnet commented 1 year ago

It's possible that this report on reddit stems from this issue.

CAFxX commented 1 year ago

It's possible that this report on reddit stems from this issue.

Possibly unrelated, OP reported it's more likely a NUMA issue, as disabling the second CCD makes it go faster: https://www.reddit.com/r/golang/comments/y4w74o/comment/isjxqby/

qiulaidongfeng commented 11 months ago

Zen4 is also affected.

I use 1.18.10 and 1.21.3.

> cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics > │ old.txt │ new.txt │ > │ sec/op │ sec/op vs base │ > Same-16 15.01µ ± 0% 12.90µ ± 0% -14.08% (p=0.000 n=10) > Ordered-16 18.58µ ± 2% 69.72µ ± 0% +275.29% (p=0.000 n=10) > Random-16 33.24µ ± 1% 71.37µ ± 1% +114.71% (p=0.000 n=10) > geomean 21.01µ 40.04µ +90.59%

golang / go

cmd/compile: jump tables perform much worse on AMD zen3? #53331