golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.16k stars 17.69k forks source link

cmd/compile: fluent interfaces doing new allocations in go 1.19 #57434

Closed CannibalVox closed 1 year ago

CannibalVox commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.19.4 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Stephen\AppData\Local\go-build
set GOENV=C:\Users\Stephen\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\msys64\mingw64\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:/msys64/mingw64
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.19.4
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\Stephen\projects\vkngwrapper\math\go.mod
set GOWORK=C:\Users\Stephen\projects\vkngwrapper\go.work
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\msys64\tmp\go-build13049274=/tmp/go-build -gno-record-gcc-switches

What did you do?

Ran the following benchmarks:

package math

import "testing"

var output int

type Object struct {
    Val int
}

func (o *Object) Initialize() *Object {
    o.Val = 5
    return o
}

func (o *Object) Update() *Object {
    o.Val = o.Val + 1
    return o
}

func BenchmarkNoAllocs(b *testing.B) {
    b.ReportAllocs()

    for i := 0; i < b.N; i++ {
        var obj Object
        obj.Initialize()
        obj.Update()
        output = obj.Val
    }
}

func BenchmarkAllocs(b *testing.B) {
    b.ReportAllocs()

    for i := 0; i < b.N; i++ {
        var obj Object
        obj.Initialize().Update()
        output = obj.Val
    }
}

What did you expect to see?

I expected neither benchmark to perform any allocations.

What did you see instead?

BenchmarkAllocs allocated obj onto the heap, but BenchmarkNoAllocs does not:

$ go test -test.bench=. -test.run xx .
goos: windows
goarch: amd64
pkg: github.com/vkngwrapper/math
cpu: AMD Ryzen 7 5800X 8-Core Processor
BenchmarkNoAllocs-16                            1000000000               0.2160 ns/op          0 B/op          0 allocs/op
BenchmarkAllocs-16                              143329258                8.381 ns/op           8 B/op          1 allocs/op
PASS

However, it does not do this in 1.18.9:

$ go1.18.9 test -test.bench=. -test.run xx .
goos: windows
goarch: amd64
pkg: github.com/vkngwrapper/math
cpu: AMD Ryzen 7 5800X 8-Core Processor
BenchmarkNoAllocs-16                            1000000000               0.4301 ns/op          0 B/op          0 allocs/op
BenchmarkAllocs-16                              1000000000               0.4271 ns/op          0 B/op          0 allocs/op
PASS
dr2chase commented 1 year ago

I just did the obvious experiment, and it is not unified IR. Still slow:

GOEXPERIMENT=nounified go test foo_test.go -bench Bench
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
BenchmarkNoAllocs-8     1000000000           0.3096 ns/op          0 B/op          0 allocs/op
BenchmarkAllocs-8       75144219            13.55 ns/op        8 B/op          1 allocs/op
PASS
mrjrieke commented 1 year ago

Good find.

dr2chase commented 1 year ago

It looks like a bug in escape analysis.

GOEXPERIMENT=nounified GOSSAFUNC=BenchmarkAllocs go test -c -o foo_19 -gcflags -m=2 foo_test.go
# command-line-arguments [command-line-arguments.test]
./foo_test.go:11:6: can inline (*Object).Initialize with cost 6 as: method(*Object) func() *Object { o.Val = 5; return o }
./foo_test.go:16:6: can inline (*Object).Update with cost 9 as: method(*Object) func() *Object { o.Val = o.Val + 1; return o }
./foo_test.go:21:6: can inline BenchmarkNoAllocs with cost 51 as: func(*testing.B) { (*testing.B).ReportAllocs(b); for loop }
./foo_test.go:22:16: inlining call to testing.(*B).ReportAllocs
./foo_test.go:26:17: inlining call to (*Object).Initialize
./foo_test.go:27:13: inlining call to (*Object).Update
./foo_test.go:32:6: can inline BenchmarkAllocs with cost 52 as: func(*testing.B) { (*testing.B).ReportAllocs(b); for loop }
./foo_test.go:33:16: inlining call to testing.(*B).ReportAllocs
./foo_test.go:37:17: inlining call to (*Object).Initialize
./foo_test.go:37:26: inlining call to (*Object).Update
./foo_test.go:11:7: parameter o leaks to ~r0 with derefs=0:
./foo_test.go:11:7:   flow: ~r0 = o:
./foo_test.go:11:7:     from return o (return) at ./foo_test.go:13:2
./foo_test.go:11:7: leaking param: o to result ~r0 level=0
./foo_test.go:16:7: parameter o leaks to ~r0 with derefs=0:
./foo_test.go:16:7:   flow: ~r0 = o:
./foo_test.go:16:7:     from return o (return) at ./foo_test.go:18:2
./foo_test.go:16:7: leaking param: o to result ~r0 level=0
./foo_test.go:21:24: b does not escape
./foo_test.go:36:7: obj escapes to heap:
./foo_test.go:36:7:   flow: o = &obj:
./foo_test.go:36:7:     from obj (address-of) at ./foo_test.go:37:6
./foo_test.go:36:7:     from o := obj (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: ~R0 = o:
./foo_test.go:36:7:     from ~R0 = o (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: .autotmp_3 = ~R0:
./foo_test.go:36:7:     from .autotmp_3 := ~R0 (assign) at ./foo_test.go:37:26
./foo_test.go:32:22: b does not escape
./foo_test.go:36:7: moved to heap: obj
CannibalVox commented 1 year ago

It's possible this was already fixed? A redditor cannot repro in latest 1.20-

go version devel go1.20-fadd77c05b Wed Dec 21 20:18:10 2022 +0000 linux/amd64

BenchmarkNoAllocs-8  1000000000  0.6611 ns/op  0 B/op  0 allocs/op
BenchmarkAllocs-8    1000000000  0.6606 ns/op  0 B/op  0 allocs/op

It occurs to me that when I was asked about "latest release" in the PR template, they might have meant this version and not 1.19.4- my apologies.

cherrymui commented 1 year ago

I just did the obvious experiment, and it is not unified IR

Quite the opposite. Unified IR actually fixes it. At tip, unified IR doesn't allocate, non-unified allocates. The IR between the two looks slightly different. Non-unified generates an autotmp for the result of the first call, which is the autotmp_3 you saw in the escape analysis diagnostics. The 1.18 AST doesn't have that autotmp (more like the unified IR version). The 1.19 one does.

The escape analysis definitely looks weird. flow: .autotmp_3 = ~R0 assigning to autotmp shouldn't cause it to escape, and it doesn't print any more lines. Not sure why it thinks obj escapes to heap...

dr2chase commented 1 year ago

.autotmp_3 is to carry the result of one inlined call to another (visible with -m=5). I'm inclined to poke at this a bit more to see why it went wrong, it really should not have.

./foo_test.go:36:7:[2] BenchmarkAllocs stmt: var obj Object
./foo_test.go:36:7:[2] BenchmarkAllocs stmt: obj = <nil>
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: .autotmp_3 := ~R0
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: io := obj
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: var io *Object
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: <node inlmark>
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: io.Val = 5
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: ~R0 = io; goto .i1
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: ~R0 = io
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: var ~R0 *Object
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: goto .i1
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: .i1: 
./foo_test.go:37:17:.i1:  non-looping label
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: uo := .autotmp_3
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: var uo *Object
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: <node inlmark>
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: uo.Val = uo.Val + 1
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0 = uo; goto .i2
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0 = uo
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: var ~R0 *Object
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: goto .i2
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: .i2: 
./foo_test.go:37:26:.i2:  non-looping label
./foo_test.go:38:10:[2] BenchmarkAllocs stmt: output = obj.Val
./foo_test.go:36:7: obj escapes to heap:
./foo_test.go:36:7:   flow: io = &obj:
./foo_test.go:36:7:     from obj (address-of) at ./foo_test.go:37:6
./foo_test.go:36:7:     from io := obj (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: ~R0 = io:
./foo_test.go:36:7:     from ~R0 = io (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: .autotmp_3 = ~R0:
./foo_test.go:36:7:     from .autotmp_3 := ~R0 (assign) at ./foo_test.go:37:26
./foo_test.go:32:22: b does not escape
./foo_test.go:36:7: moved to heap: obj
cuonglm commented 1 year ago

This is introduced in https://go-review.googlesource.com/c/go/+/392834, and Unified IR also has this problem until https://go-review.googlesource.com/c/go/+/421821.

~We need to ensure non-Name call in subexpression call is also rewritten.~

gopherbot commented 1 year ago

Change https://go.dev/cl/459295 mentions this issue: cmd/compile: ensure non-Name call in subexpression call is rewritten

cuonglm commented 1 year ago

@dr2chase @cherrymui Is performance regression legal for backport?

cherrymui commented 1 year ago

What do you mean by a "non-Name call"? A call to a func value that is not a named function? In this case it doesn't have a func value. Or the method selection expression is one? That would be usually considered as a named function call, though. I'm not sure I understand how this connects with escape analysis.

cherrymui commented 1 year ago

I think a significant performance regression would qualify a backport. But I'm not sure this one is significant. It would need to see an impact on real code, instead of synthetic benchmarks.

cuonglm commented 1 year ago

What do you mean by a "non-Name call"? A call to a func value that is not a named function? In this case it doesn't have a func value. Or the method selection expression is one? That would be usually considered as a named function call, though. I'm not sure I understand how this connects with escape analysis.

"non-Name call" is a function call where the call expression is not a ONAME node. In (*Object).Initialize(obj).Update(), after peeling a way the ODOTMETH Update, we have (*Object).Initialize(obj) which is a call. RewriteNonNameCall see that a call and attempt to assign it to a temp.

I think if the call have no side-effect argument, then we should not apply RewriteNonNameCall. The purpose of that function is making sure the oder of evaluation in f()(g()), in case of f()(), we can skip the rewriting.

I think a significant performance regression would qualify a backport. But I'm not sure this one is significant. It would need to see an impact on real code, instead of synthetic benchmarks.

~I think this affects https://groups.google.com/g/golang-dev/c/8blU7CnuhZ4/m/d21xuN8hAQAJ?utm_medium=email&utm_source=footer~

cherrymui commented 1 year ago

where the call expression is not a ONAME node

Sorry, I must still have missed something. When would a call expression be an ONAME node? I'd think a call expression would be an OCALL or OCALLXXX node. Or you mean the function being called? (*Object).Initialize(obj) is not returning a func value, though, but a *Object.

Back off a level. The AST is not malformed. But the escape analysis behave weirdly, at least it looks like so. I'd think we want to understand why the escape analysis doesn't work as expected. Is it fundamentally not able to handle the current AST? (Then changing the AST would probably be the right answer.) Or it is something else? Thanks.

I think this affects https://groups.google.com/g/golang-dev/c/8blU7CnuhZ4/m/d21xuN8hAQAJ?utm_medium=email&utm_source=footer

That looks like a regression from 1.19 to 1.20, though. We may want to address it before 1.20 release. But no backport involved.

cuonglm commented 1 year ago

Sorry, I must still have missed something. When would a call expression be an ONAME node? I'd think a call expression would be an OCALL or OCALLXXX node. Or you mean the function being called?

Sorry, I mean the callee expression. In f()(), the callee expression is f(), which is a non-Name node.

(Object).Initialize(obj) is not returning a func value, though, but a Object.

Yeah, I am talking at AST level. (Object).Initialize(obj) is the callee expression of the call expression (Object).Initialize(obj).Update()

Back off a level. The AST is not malformed. But the escape analysis behave weirdly, at least it looks like so. I'd think we want to understand why the escape analysis doesn't work as expected. Is it fundamentally not able to handle the current AST? (Then changing the AST would probably be the right answer.) Or it is something else?

I think it's reasonable for escape analysis to behave like that, because we now introduce new temp var for (Object).Initialize(obj) call. Before CL 392834, escape analysis would see (Object).Initialize(obj).Update() call. But now it would see:

tmp := (*Object).Initialize(obj)
tmp.Update()

We have a special case for ONEW inside RewriteNonNameCall, which will help a better escape analysis if we not rewriting at all.

cherrymui commented 1 year ago

(Object).Initialize(obj) is the callee expression of the call expression (Object).Initialize(obj).Update()

Sorry, I must still have missed something... If by "callee expression" you mean the function being called, in this case it would be (*Object).Initialize(obj).Update, instead of (*Object).Initialize(obj).

I think it's reasonable for escape analysis to behave like that

I don't think so, sorry. In fact, if I rewrite the source code of BenchmarkAllocs to

                tmp := obj.Initialize()
                tmp.Update()

it actually does not allocate. Also not allocate with

                tmp := (*Object).Initialize(&obj)
                tmp.Update()
cuonglm commented 1 year ago

Sorry, I must still have missed something... If by "callee expression" you mean the function being called, in this case it would be (Object).Initialize(obj).Update, instead of (Object).Initialize(obj).

It is! In case of ODOTMETH, ODOTINTER, OMETHVALUE, we peel away the selector, so we have (*Object).Initialize(obj). That being said, for x.g(), it's a Name call, because after peeling away selector in callee expression, we have x left. But for x.f().g(), we have x.f() left, which is a non-Name call.

I don't think so, sorry. In fact, if I rewrite the source code of BenchmarkAllocs to

Hmm, sounds like there's actual mismatch between emitting autotmp vs hand written one. In case of hand written one, there's a ODCL node for tmp at the top of for body. Why in case of emitting autotmp, we haven't:

tmp := Temp((*np).Type())
as := ir.NewAssignStmt(base.Pos, tmp, *np)
as.Def = true

So you're right that fixing the AST (emitting the ODCL for autotmp) will fix the problem.

dr2chase commented 1 year ago

I still think this is incorrect behavior by escape analysis, but it is taking me a while to figure out exactly where it goes wrong. I've been stepping through the compiler in a debugger since yesterday (as time/focused-attention permit). But thanks for all the investigation, this will tell me where to be especially attentive.

Also, everything gets inlined. This is in code that no longer contains real calls, though there are INLCALLs. Here's the pruned test case:

package foo

var output int

type Object struct {
    Val int
}

type B struct {
    N int
}

func (io *Object) Initialize() *Object {
    io.Val = 5
    return io
}

func (uo *Object) Update() *Object {
    uo.Val = uo.Val + 1
    return uo
}

func BenchmarkAllocs(b *B) {
    for i := 0; i < b.N; i++ {
        var obj Object
        obj.Initialize().Update()
        output = obj.Val
    }
}
cuonglm commented 1 year ago

Also, everything gets inlined. This is in code that no longer contains real calls,

Yeah, but all discussion above happens during old typecheck pass, so before any inlining can happen.

dr2chase commented 1 year ago

The escape analysis runs on inlined code. That's where the decision is made wrong, in my opinion. If this AST is actually malformed, we need some way of rejecting it. Are autotmps required to have ODCL nodes, or would it just solve this problem for us? I just now went looking at uses of typecheck.Temp and I don't see them getting declared; rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.

cuonglm commented 1 year ago

Are autotmps required to have ODCL nodes, or would it just solve this problem for us? I just now went looking at uses of typecheck.Temp and I don't see them getting declared; rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.

No, autotmps do not require to ave ODCL node, but emitting ODCL will help escape analysis record the loop depth correctly.

rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.

What do you mean "50"? Is this the diff in CL 459295 or something else? If you refer to the diff, you may want to look at latest patchset, which is just one line change.

dr2chase commented 1 year ago

The problem is loopdepth, but I think this is a bug in escape analysis -- anytime there is an autotmp used in a loop, its loopdepth will be wrong unless it is declared, and so far, none of them are declared.

"50" refers to places in the compiler source code where we create temporaries; any of these with a pointer type is conceivably a source of such an escape analysis hiccup. And, also, adding all these declarations for temporaries where none existed before might cause other problems. I think the heuristic to try is that the loop level of an autotmp is the minimum level at which it is assigned; that is where it would have been declared.

cuonglm commented 1 year ago

And, also, adding all these declarations for temporaries where none existed before might cause other problems.

I'm AFK, but there's a place in compiler where we emit ODCL for autotmp, I cant find it offhand, though.

cuonglm commented 1 year ago

I think the heuristic to try is that the loop level of an autotmp is the minimum level at which it is assigned; that is where it would have been declared.

Hmm, I think it can apply not just for autotmp, but any local ONAME that appear in the LHS of the assignment. If as.Def is true, we should record the loopdepth.

dr2chase commented 1 year ago

ONAMEs can have assignments at any depth. It's important to record the least-nested visibility because that is a form of escape (many-to-one). For autotmps, we do seem to declare them, and we do seem to declare them at varying depth (i.e., I logged this stuff complete with stack traces) and it "seems" to get them right for 1.20. I think for 1.19, this is just a minor performance regression and we live with it. Arguably, as the keeper of many benchmarks, I should add this microbenchmark, since it would be nice not to regress again.

cherrymui commented 1 year ago

Sorry for the late reply. I agree with @dr2chase that targeting the escape analysis is the right thing to do here. And the heuristic of using the minimum level where an autotmp is assigned is reasonable to me (unless it is otherwise declared).

cuonglm commented 1 year ago

Sorry for the late reply. I agree with @dr2chase that targeting the escape analysis is the right thing to do here. And the heuristic of using the minimum level where an autotmp is assigned is reasonable to me (unless it is otherwise declared).

Agree, but if we're going to backport, then emitting ODCL seems safer. Otherwise, we can fix escape analysis in 1.21 cycle.

gopherbot commented 1 year ago

Change https://go.dev/cl/459496 mentions this issue: cmd/compile: record loopdepth for autotmp if declared in assignment

cherrymui commented 1 year ago

Besides the escape analysis change, I think the escape analysis diagnostics also needs some improvement. If it is the assignment .autotmp_3 = ~R0 causing the escape due to loop depth, it would be good to mention loop depth. Otherwise it is very confusing why this assignment causes problem.

cuonglm commented 1 year ago

Besides the escape analysis change, I think the escape analysis diagnostics also needs some improvement. If it is the assignment .autotmp_3 = ~R0 causing the escape due to loop depth, it would be good to mention loop depth. Otherwise it is very confusing why this assignment causes problem.

With -m=3, loop depth is included in diagnostics:

p.go:26:26:[2] BenchmarkAllocs stmt: .autotmp_3 := ~R0

[2] -> loop depth is 2.