golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.08k stars 17.55k forks source link

cmd/compile: Get rid of redundant TEST instructions by using the ZF flag of DEC/INC #49113

Open jake-ciolek opened 2 years ago

jake-ciolek commented 2 years ago

What version of Go are you using (go version)?

go tip

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOVCS=""
GOVERSION="devel go1.18-9ff91b9098 Fri Oct 22 00:57:18 2021 +0000"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/7d/d_44vx7920vfy_tmc_l6xgprjh03ym/T/go-build2772586434=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Dumped the Go assembly itself, but this happens for anything compiled with gc.

What did you expect to see?

The compiler making use of the ZF flag value when emitting DEC(Q|L) / INC(Q|L) instructions.

What did you see instead?

Redundant TEST(Q|L) instructions for checking if the result is zero, for example:

 _math/big.nat.shl:

decq    %rdx
testq   %rdx, %rdx

 _cmd/go/internal/work.(*Builder).Do.func3:

decl    %esi
testl   %esi, %esi

 _go/token.(*FileSet).file:

incl    %edx
testl   %edx, %edx

 _cmd/go/internal/work.gcToolchain.gc:

incq    %r9
testq   %r9, %r9

We could get rid of testq and use the value of ZF set by INCQ/DECQ instead.

I have looked into this briefly and it seems that it might be better to move the INC/DEC implementation to AMD64.rules instead? Currently it's handled in a special way inside of cmd/cmpile/internal/amd64/ssa.go

martisch commented 2 years ago

The non optimization of using flags from arithmetic instructions directly is a general gap (not just INC/DEC).

I think I got a bit around it by defining new instruction for MUL with flag output then optimizing the SETcc CMP away in https://go-review.googlesource.com/c/go/+/141820. If we going to do this more generally we should probably look for a generic way how to easily use the flag output of any instruction in rules files for branching instructions that can consume those flags.

jake-ciolek commented 2 years ago

I think it might go even beyond arithmetic. I've looked at doing this for AND(Q|L)const/XOR(Q|L)const , but ran into issues related to handling blocks...

I've changed the Op definition so it emitted two values (Uint and Flags) but was unable to make it work for some reason. It was somehow failing when handling OpAMD64SETEQ / OpAMD64SETNE and others.

I think we could get a lot of speedup on x86 by doing this.

josharian commented 2 years ago

I'd love to see a general purpose approach to this. Maybe a pass that occurs after flagalloc, to avoid the issues around spilling flags.

josharian commented 2 years ago

I've changed the Op definition so it emitted two values (Uint and Flags) but was unable to make it work for some reason.

Feel free to post a CL showing your changes and the errors if you'd like some help getting unstuck. (Although as mentioned I'd prefer something general purpose.)

ulikunitz commented 2 years ago

Are there micro-benchmarks proving this is a speed increase? Chips are doing macro-op and micro-op fusions nowadays. The test and the following jump are very clearly in that territory. You might end-up with smaller binaries, which will help caching and decoding, but no real speed increase on the µOp level.

jake-ciolek commented 2 years ago

@ulikunitz

I believe reducing code size will improve performance, for example this change where we replace longer forms of AND with a shorter one clearly boosted performance:

https://go-review.googlesource.com/c/go/+/354970

Additionally, even though TEST/JUMP fuses, DEC/JUMP fuses too. Putting a test in between makes A DEC/TEST/JUMP execute one extra instruction.

The same goes for AND, ADD and SUB.

jake-ciolek commented 2 years ago

There's another thing, for which I think I'll start a new issue.

Sometimes we generate code that prevents macro-op fusion from happening because there's an instruction between TEST and a JUMP or a CMP and a JUMP. That instruction could be put before TEST/CMP and the execution of the program wouldn't change, but the speed would increase.

martisch commented 2 years ago

@jake-ciolek RE: removing instructions in between. I think often this can increase performance but I think we should still prove that with microbenchmarks or prove its not worse but better for binary size. There AFAIK can be cases were it doesnt improve performance and not all amd64 do op fusion or deal with partial register flags updates (INC/DEC) equally well without delays. (e.g. if I remember correctly Sandy Bridge has a flag merging uOP delay after some INC/DEC anyway.)