cmd/compile: large type switches that always match the first value are slower in Go 1.22

dominiquelefevre commented 2 months ago

Go version

go version go1.22.0 linux/amd64

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/dominique/.cache/go-build'
GOENV='/home/dominique/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/dominique/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/dominique/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.0'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1026376215=/tmp/go-build -gno-record-gcc-switches'

What did you do?

The benchmark BenchmarkDeepValueSameGoRoutine in package context became 2x slower in go 1.22.

Run the following command:

go test -v -run=BenchmarkDeepValueSameGoRoutine -bench=BenchmarkDeepValueSameGoRoutine

What did you see happen?

With go 1.21.0, I get the following:

goos: linux
goarch: amd64
pkg: context
cpu: AMD Ryzen 9 3950X 16-Core Processor
BenchmarkDeepValueSameGoRoutine
BenchmarkDeepValueSameGoRoutine/depth=10
BenchmarkDeepValueSameGoRoutine/depth=10-16             35302640                33.76 ns/op
BenchmarkDeepValueSameGoRoutine/depth=20
BenchmarkDeepValueSameGoRoutine/depth=20-16             15910783                69.36 ns/op
BenchmarkDeepValueSameGoRoutine/depth=30
BenchmarkDeepValueSameGoRoutine/depth=30-16             11908124                98.36 ns/op
BenchmarkDeepValueSameGoRoutine/depth=50
BenchmarkDeepValueSameGoRoutine/depth=50-16              7204759               164.2 ns/op
BenchmarkDeepValueSameGoRoutine/depth=100
BenchmarkDeepValueSameGoRoutine/depth=100-16             3947766               298.7 ns/op
PASS
ok      context 6.532s

With go 1.22.0 I get this:

goos: linux
goarch: amd64
pkg: context
cpu: AMD Ryzen 9 3950X 16-Core Processor
BenchmarkDeepValueSameGoRoutine
BenchmarkDeepValueSameGoRoutine/depth=10
BenchmarkDeepValueSameGoRoutine/depth=10-16             19228312                62.40 ns/op
BenchmarkDeepValueSameGoRoutine/depth=20
BenchmarkDeepValueSameGoRoutine/depth=20-16             11805692               102.0 ns/op
BenchmarkDeepValueSameGoRoutine/depth=30
BenchmarkDeepValueSameGoRoutine/depth=30-16              8119362               141.5 ns/op
BenchmarkDeepValueSameGoRoutine/depth=50
BenchmarkDeepValueSameGoRoutine/depth=50-16              5516227               216.1 ns/op
BenchmarkDeepValueSameGoRoutine/depth=100
BenchmarkDeepValueSameGoRoutine/depth=100-16             2936217               412.7 ns/op
PASS
ok      context 6.924s

What did you expect to see?

No performance degradation.

gabyhelp commented 2 months ago

Similar Issues

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

dominiquelefevre commented 2 months ago

I've bisected this degradation to two commits. The first is certainly related to type switches. The second one seems wild, although I have a theory how it can influence things.

The first commit is

commit 28f4ea16a240af6c5a417e20be77745329f817f1
Author: Keith Randall <khr@golang.org>
Date:   Fri Sep 1 20:32:07 2023 -0700

    cmd/compile: improve interface type switches

    For type switches where the targets are interface types,
    call into the runtime once instead of doing a sequence
    of assert* calls.

    name                                 old time/op  new time/op  delta
    SwitchInterfaceTypePredictable-24    26.6ns ± 1%  25.8ns ± 2%  -2.86%  (p=0.000 n=10+10)
    SwitchInterfaceTypeUnpredictable-24  39.3ns ± 1%  37.5ns ± 2%  -4.57%  (p=0.000 n=10+10)

    Not super helpful by itself, but this code organization allows
    followon CLs that add caching to the lookups.

    Change-Id: I7967f85a99171faa6c2550690e311bea8b54b01c
    Reviewed-on: https://go-review.googlesource.com/c/go/+/526657
    Reviewed-by: Matthew Dempsky <mdempsky@google.com>
    LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
    Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
    Reviewed-by: Keith Randall <khr@google.com>

 src/cmd/compile/internal/ir/node.go                |  23 +-
 src/cmd/compile/internal/ir/node_gen.go            |  46 +++
 src/cmd/compile/internal/ir/op_string.go           |  21 +-
 src/cmd/compile/internal/ir/stmt.go                |  37 ++
 src/cmd/compile/internal/ir/symtab.go              |   1 +
 src/cmd/compile/internal/ssagen/ssa.go             |  10 +
 src/cmd/compile/internal/test/switch_test.go       | 117 ++++++
 .../compile/internal/typecheck/_builtin/runtime.go |   3 +
 src/cmd/compile/internal/typecheck/builtin.go      | 398 +++++++++++----------
 src/cmd/compile/internal/walk/stmt.go              |   1 +
 src/cmd/compile/internal/walk/switch.go            | 390 +++++++++++++-------
 src/internal/abi/switch.go                         |  13 +
 src/runtime/iface.go                               |  16 +
 test/codegen/switch.go                             |  21 ++
 14 files changed, 742 insertions(+), 355 deletions(-)
 create mode 100644 src/internal/abi/switch.go

The benchmark results at commit 28f4ea1:

goos: linux
goarch: amd64
pkg: context
cpu: AMD Ryzen 9 3950X 16-Core Processor
BenchmarkDeepValueSameGoRoutine
BenchmarkDeepValueSameGoRoutine/depth=10
BenchmarkDeepValueSameGoRoutine/depth=10-16             17287828                66.11 ns/op
BenchmarkDeepValueSameGoRoutine/depth=20
BenchmarkDeepValueSameGoRoutine/depth=20-16             10979985               115.8 ns/op
BenchmarkDeepValueSameGoRoutine/depth=30
BenchmarkDeepValueSameGoRoutine/depth=30-16              7719157               178.2 ns/op
BenchmarkDeepValueSameGoRoutine/depth=50
BenchmarkDeepValueSameGoRoutine/depth=50-16              4705028               249.5 ns/op
BenchmarkDeepValueSameGoRoutine/depth=100
BenchmarkDeepValueSameGoRoutine/depth=100-16             2414732               498.4 ns/op
PASS
ok      context 7.277s

The benchmark results at commit 10da3b6 which immediatly precedes 28f4ea1:

goos: linux
goarch: amd64
pkg: context
cpu: AMD Ryzen 9 3950X 16-Core Processor
BenchmarkDeepValueSameGoRoutine
BenchmarkDeepValueSameGoRoutine/depth=10
BenchmarkDeepValueSameGoRoutine/depth=10-16             26770080                40.36 ns/op
BenchmarkDeepValueSameGoRoutine/depth=20
BenchmarkDeepValueSameGoRoutine/depth=20-16             13413070                88.30 ns/op
BenchmarkDeepValueSameGoRoutine/depth=30
BenchmarkDeepValueSameGoRoutine/depth=30-16             10121890               131.7 ns/op
BenchmarkDeepValueSameGoRoutine/depth=50
BenchmarkDeepValueSameGoRoutine/depth=50-16              6123589               216.5 ns/op
BenchmarkDeepValueSameGoRoutine/depth=100
BenchmarkDeepValueSameGoRoutine/depth=100-16             3088478               415.7 ns/op
PASS
ok      context 7.071s

dominiquelefevre commented 2 months ago

The second commit surprised me, but it also degrades the benchmark noticeably:

commit fa4f951026f697bc042422d95a0806dcbab7ddd0
Author: Jes Cok <xigua67damn@gmail.com>
Date:   Fri Sep 29 06:06:40 2023 +0000

    syscall: make origRlimitNofile atomic.Pointer[Rlimit]

    Currently we are bootstrapping with Go 1.20, origRlimitNofile can
    be changed to atomic.Pointer[Rlimit].

    Change-Id: I00ce9d1a9030bd5dbd34e3dc6c4e38683a87be86
    GitHub-Last-Rev: f2ccdb38412019d10661ed6be42086b445e411bf
    GitHub-Pull-Request: golang/go#63274
    Reviewed-on: https://go-review.googlesource.com/c/go/+/531516
    LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
    Reviewed-by: qiulaidongfeng <2645477756@qq.com>
    Auto-Submit: Ian Lance Taylor <iant@google.com>
    Reviewed-by: Ian Lance Taylor <iant@google.com>
    Reviewed-by: Than McIntosh <thanm@google.com>

 src/syscall/exec_bsd.go           |  6 +++---
 src/syscall/exec_freebsd.go       |  6 +++---
 src/syscall/exec_libc.go          |  6 +++---
 src/syscall/exec_libc2.go         |  6 +++---
 src/syscall/exec_linux.go         |  6 +++---
 src/syscall/exec_unix.go          |  6 +++---
 src/syscall/export_rlimit_test.go |  4 ++--
 src/syscall/rlimit.go             | 12 +++++-------
 src/syscall/syscall_linux.go      |  2 +-
 9 files changed, 26 insertions(+), 28 deletions(-)

I get the following results when benchmarking fa4f951:

goos: linux
goarch: amd64
pkg: context
cpu: AMD Ryzen 9 3950X 16-Core Processor
BenchmarkDeepValueSameGoRoutine
BenchmarkDeepValueSameGoRoutine/depth=10
BenchmarkDeepValueSameGoRoutine/depth=10-16             29495116                40.77 ns/op
BenchmarkDeepValueSameGoRoutine/depth=20
BenchmarkDeepValueSameGoRoutine/depth=20-16             12556218                92.13 ns/op
BenchmarkDeepValueSameGoRoutine/depth=30
BenchmarkDeepValueSameGoRoutine/depth=30-16              8599995               132.6 ns/op
BenchmarkDeepValueSameGoRoutine/depth=50
BenchmarkDeepValueSameGoRoutine/depth=50-16              6290308               206.4 ns/op
BenchmarkDeepValueSameGoRoutine/depth=100
BenchmarkDeepValueSameGoRoutine/depth=100-16             3201490               387.7 ns/op
PASS
ok      context 6.906s

And these are the results of 9bfaaa1:

goos: linux
goarch: amd64
pkg: context
cpu: AMD Ryzen 9 3950X 16-Core Processor
BenchmarkDeepValueSameGoRoutine
BenchmarkDeepValueSameGoRoutine/depth=10
BenchmarkDeepValueSameGoRoutine/depth=10-16             32503959                34.20 ns/op
BenchmarkDeepValueSameGoRoutine/depth=20
BenchmarkDeepValueSameGoRoutine/depth=20-16             17294968                81.50 ns/op
BenchmarkDeepValueSameGoRoutine/depth=30
BenchmarkDeepValueSameGoRoutine/depth=30-16             12121272               100.1 ns/op
BenchmarkDeepValueSameGoRoutine/depth=50
BenchmarkDeepValueSameGoRoutine/depth=50-16              7719627               158.2 ns/op
BenchmarkDeepValueSameGoRoutine/depth=100
BenchmarkDeepValueSameGoRoutine/depth=100-16             3951139               297.7 ns/op
PASS
ok      context 6.818s

dominiquelefevre commented 2 months ago

Go 1.23 (commit 44f1870) performs in this benchmark the same way as go 1.22.0.

callthingsoff commented 2 months ago

% go1.22-9bfaaa15fd version
go version devel go1.22-9bfaaa15fd Fri Sep 29 14:58:44 2023 +0000 darwin/arm64
%                          
% go1.22-fa4f951026 version
go version devel go1.22-fa4f951026 Fri Sep 29 18:51:35 2023 +0000 darwin/arm64
%

% go1.22-9bfaaa15fd test -v -run=BenchmarkDeepValueSameGoRoutine -bench=BenchmarkDeepValueSameGoRoutine -count 10 > 9bfaaa15fd.out
% go1.22-fa4f951026 test -v -run=BenchmarkDeepValueSameGoRoutine -bench=BenchmarkDeepValueSameGoRoutine -count 10 > fa4f951026.out

% ~/go/bin/benchstat 9bfaaa15fd.out fa4f951026.out
goos: darwin
goarch: arm64
pkg: context
                                   │ 9bfaaa15fd.out │           fa4f951026.out           │
                                   │     sec/op     │   sec/op     vs base               │
DeepValueSameGoRoutine/depth=10-8       27.84n ± 1%   27.61n ± 1%       ~ (p=0.171 n=10)
DeepValueSameGoRoutine/depth=20-8       66.41n ± 2%   67.03n ± 1%       ~ (p=0.494 n=10)
DeepValueSameGoRoutine/depth=30-8       94.55n ± 1%   93.91n ± 2%       ~ (p=0.305 n=10)
DeepValueSameGoRoutine/depth=50-8       152.2n ± 0%   150.0n ± 2%  -1.48% (p=0.033 n=10)
DeepValueSameGoRoutine/depth=100-8      293.7n ± 0%   289.0n ± 2%  -1.58% (p=0.000 n=10)
geomean

I tried versions 9bfaaa15fd and fa4f951026 and found that there was no performance degradation, but it got better. I don't know if there is any problem with my operation.

dominiquelefevre commented 2 months ago

@callthingsoff I do not think there is a problem with your operation. You've tested with a different CPU. It suggests that this benchmark got unlucky with icache aliasing after some code was moved around. I would not bother too much with fa4f951.

randall77 commented 2 months ago

I'm pretty sure this is because we're now compiling the type switch in context.value using a jump table instead of a linear search. Which makes me thing the binary search should have indicated CL 521497, not CL 526657. Maybe the latter happened to allow the binary search in some way it wasn't available before. My guess is it handles the multiple entries in a single case statement better.

In any case, the reason it is slower is the benchmark is basically optimal for linear search, as the case selected is always the first one. Using a jump table requires more overhead but should do better when other cases are the common one.

Just to verify I changed the code of context.value to put the *valueCtx case last instead of first, and tip is now a bit faster than 1.21.

So, not entirely clear what should be done here. Perhaps if value contexts are really the common case, we can promote that case out of the switch into its own if statement. We could also raise the threshold for using jump tables. Or we could chalk this up to unfortunate and do nothing.

dominiquelefevre commented 2 months ago

Well, value contexts are really common. I found this benchmark while figuring it out why opentelemetry added so much overhead in a project of mine. It is a very typical pattern to have a chain of contexts like "context with parent span info", "context with my current span", "context with one more stats collector", "context with logger fields", etc. Opentelemetry does that a lot, as well as sentry, azure sdk, golang.org/x/net, etc. Fetching any value from a context that requires going up this chain spends a noticeable amount of time in context.value.

ianlancetaylor commented 2 months ago

Different coding patterns are going to have different frequencies for context types. It's not clear to me that valueCtx is the most common type in general, though it certainly is in BenchmarkDeepValueSameGoRoutine. Different programs are going to have stacks of cancelCtx and/or timerCtx.

That said, @randall77 how did you pick the value 5 for minCases in tryJumpTable? My intuition, without doing any actual testing, is that that seems low, because direct comparisons against 5 constant values is going to be pretty quick. But maybe I don't really understand how gc implements type switches. In this particular example the type switch has 6 cases (and a default case).

randall77 commented 2 months ago

It was basically chosen using cmd/compile/internal/test/switch_test.go:BenchmarkSwitchType*. It looked better certainly at 8, both for the predictable and (even more for) unpredictable cases. It was less clear at smaller sizes, but 5 seemed the right threshold because that's the point at which you start having 3 compare/branch to reach some leaves of the binary search (which you need for every case when n=8). It's certainly possible that a different threshold is warranted. I did not do intensive testing. The only challenge is that it depends on predictability, which we don't know at compile time. So we have to posit some "average" predictability and target that.

dr2chase commented 2 months ago

and perhaps we put a PGO-informed TODO on the switch code? If the profile is true, that might be really helpful.

randall77 commented 2 months ago

@dr2chase Indeed. We'll need basic block profiles #65466 though.

golang / go