Closed CannibalVox closed 1 year ago
I just did the obvious experiment, and it is not unified IR. Still slow:
GOEXPERIMENT=nounified go test foo_test.go -bench Bench
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
BenchmarkNoAllocs-8 1000000000 0.3096 ns/op 0 B/op 0 allocs/op
BenchmarkAllocs-8 75144219 13.55 ns/op 8 B/op 1 allocs/op
PASS
Good find.
It looks like a bug in escape analysis.
GOEXPERIMENT=nounified GOSSAFUNC=BenchmarkAllocs go test -c -o foo_19 -gcflags -m=2 foo_test.go
# command-line-arguments [command-line-arguments.test]
./foo_test.go:11:6: can inline (*Object).Initialize with cost 6 as: method(*Object) func() *Object { o.Val = 5; return o }
./foo_test.go:16:6: can inline (*Object).Update with cost 9 as: method(*Object) func() *Object { o.Val = o.Val + 1; return o }
./foo_test.go:21:6: can inline BenchmarkNoAllocs with cost 51 as: func(*testing.B) { (*testing.B).ReportAllocs(b); for loop }
./foo_test.go:22:16: inlining call to testing.(*B).ReportAllocs
./foo_test.go:26:17: inlining call to (*Object).Initialize
./foo_test.go:27:13: inlining call to (*Object).Update
./foo_test.go:32:6: can inline BenchmarkAllocs with cost 52 as: func(*testing.B) { (*testing.B).ReportAllocs(b); for loop }
./foo_test.go:33:16: inlining call to testing.(*B).ReportAllocs
./foo_test.go:37:17: inlining call to (*Object).Initialize
./foo_test.go:37:26: inlining call to (*Object).Update
./foo_test.go:11:7: parameter o leaks to ~r0 with derefs=0:
./foo_test.go:11:7: flow: ~r0 = o:
./foo_test.go:11:7: from return o (return) at ./foo_test.go:13:2
./foo_test.go:11:7: leaking param: o to result ~r0 level=0
./foo_test.go:16:7: parameter o leaks to ~r0 with derefs=0:
./foo_test.go:16:7: flow: ~r0 = o:
./foo_test.go:16:7: from return o (return) at ./foo_test.go:18:2
./foo_test.go:16:7: leaking param: o to result ~r0 level=0
./foo_test.go:21:24: b does not escape
./foo_test.go:36:7: obj escapes to heap:
./foo_test.go:36:7: flow: o = &obj:
./foo_test.go:36:7: from obj (address-of) at ./foo_test.go:37:6
./foo_test.go:36:7: from o := obj (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7: flow: ~R0 = o:
./foo_test.go:36:7: from ~R0 = o (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7: flow: .autotmp_3 = ~R0:
./foo_test.go:36:7: from .autotmp_3 := ~R0 (assign) at ./foo_test.go:37:26
./foo_test.go:32:22: b does not escape
./foo_test.go:36:7: moved to heap: obj
It's possible this was already fixed? A redditor cannot repro in latest 1.20-
go version devel go1.20-fadd77c05b Wed Dec 21 20:18:10 2022 +0000 linux/amd64
BenchmarkNoAllocs-8 1000000000 0.6611 ns/op 0 B/op 0 allocs/op
BenchmarkAllocs-8 1000000000 0.6606 ns/op 0 B/op 0 allocs/op
It occurs to me that when I was asked about "latest release" in the PR template, they might have meant this version and not 1.19.4- my apologies.
I just did the obvious experiment, and it is not unified IR
Quite the opposite. Unified IR actually fixes it. At tip, unified IR doesn't allocate, non-unified allocates. The IR between the two looks slightly different. Non-unified generates an autotmp for the result of the first call, which is the autotmp_3
you saw in the escape analysis diagnostics. The 1.18 AST doesn't have that autotmp (more like the unified IR version). The 1.19 one does.
The escape analysis definitely looks weird. flow: .autotmp_3 = ~R0
assigning to autotmp shouldn't cause it to escape, and it doesn't print any more lines. Not sure why it thinks obj escapes to heap
...
.autotmp_3
is to carry the result of one inlined call to another (visible with -m=5
). I'm inclined to poke at this a bit more to see why it went wrong, it really should not have.
./foo_test.go:36:7:[2] BenchmarkAllocs stmt: var obj Object
./foo_test.go:36:7:[2] BenchmarkAllocs stmt: obj = <nil>
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: .autotmp_3 := ~R0
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: io := obj
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: var io *Object
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: <node inlmark>
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: io.Val = 5
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: ~R0 = io; goto .i1
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: ~R0 = io
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: var ~R0 *Object
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: goto .i1
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: .i1:
./foo_test.go:37:17:.i1: non-looping label
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: uo := .autotmp_3
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: var uo *Object
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: <node inlmark>
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: uo.Val = uo.Val + 1
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0 = uo; goto .i2
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0 = uo
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: var ~R0 *Object
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: goto .i2
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: .i2:
./foo_test.go:37:26:.i2: non-looping label
./foo_test.go:38:10:[2] BenchmarkAllocs stmt: output = obj.Val
./foo_test.go:36:7: obj escapes to heap:
./foo_test.go:36:7: flow: io = &obj:
./foo_test.go:36:7: from obj (address-of) at ./foo_test.go:37:6
./foo_test.go:36:7: from io := obj (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7: flow: ~R0 = io:
./foo_test.go:36:7: from ~R0 = io (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7: flow: .autotmp_3 = ~R0:
./foo_test.go:36:7: from .autotmp_3 := ~R0 (assign) at ./foo_test.go:37:26
./foo_test.go:32:22: b does not escape
./foo_test.go:36:7: moved to heap: obj
This is introduced in https://go-review.googlesource.com/c/go/+/392834, and Unified IR also has this problem until https://go-review.googlesource.com/c/go/+/421821.
~We need to ensure non-Name call in subexpression call is also rewritten.~
Change https://go.dev/cl/459295 mentions this issue: cmd/compile: ensure non-Name call in subexpression call is rewritten
@dr2chase @cherrymui Is performance regression legal for backport?
What do you mean by a "non-Name call"? A call to a func value that is not a named function? In this case it doesn't have a func value. Or the method selection expression is one? That would be usually considered as a named function call, though. I'm not sure I understand how this connects with escape analysis.
I think a significant performance regression would qualify a backport. But I'm not sure this one is significant. It would need to see an impact on real code, instead of synthetic benchmarks.
What do you mean by a "non-Name call"? A call to a func value that is not a named function? In this case it doesn't have a func value. Or the method selection expression is one? That would be usually considered as a named function call, though. I'm not sure I understand how this connects with escape analysis.
"non-Name call" is a function call where the call expression is not a ONAME
node. In (*Object).Initialize(obj).Update()
, after peeling a way the ODOTMETH
Update
, we have (*Object).Initialize(obj)
which is a call. RewriteNonNameCall
see that a call and attempt to assign it to a temp.
I think if the call have no side-effect argument, then we should not apply RewriteNonNameCall
. The purpose of that function is making sure the oder of evaluation in f()(g())
, in case of f()()
, we can skip the rewriting.
I think a significant performance regression would qualify a backport. But I'm not sure this one is significant. It would need to see an impact on real code, instead of synthetic benchmarks.
~I think this affects https://groups.google.com/g/golang-dev/c/8blU7CnuhZ4/m/d21xuN8hAQAJ?utm_medium=email&utm_source=footer~
where the call expression is not a ONAME node
Sorry, I must still have missed something. When would a call expression be an ONAME node? I'd think a call expression would be an OCALL or OCALLXXX node. Or you mean the function being called? (*Object).Initialize(obj)
is not returning a func value, though, but a *Object
.
Back off a level. The AST is not malformed. But the escape analysis behave weirdly, at least it looks like so. I'd think we want to understand why the escape analysis doesn't work as expected. Is it fundamentally not able to handle the current AST? (Then changing the AST would probably be the right answer.) Or it is something else? Thanks.
I think this affects https://groups.google.com/g/golang-dev/c/8blU7CnuhZ4/m/d21xuN8hAQAJ?utm_medium=email&utm_source=footer
That looks like a regression from 1.19 to 1.20, though. We may want to address it before 1.20 release. But no backport involved.
Sorry, I must still have missed something. When would a call expression be an ONAME node? I'd think a call expression would be an OCALL or OCALLXXX node. Or you mean the function being called?
Sorry, I mean the callee expression. In f()()
, the callee expression is f()
, which is a non-Name node.
(Object).Initialize(obj) is not returning a func value, though, but a Object.
Yeah, I am talking at AST level. (Object).Initialize(obj) is the callee expression of the call expression (Object).Initialize(obj).Update()
Back off a level. The AST is not malformed. But the escape analysis behave weirdly, at least it looks like so. I'd think we want to understand why the escape analysis doesn't work as expected. Is it fundamentally not able to handle the current AST? (Then changing the AST would probably be the right answer.) Or it is something else?
I think it's reasonable for escape analysis to behave like that, because we now introduce new temp var for (Object).Initialize(obj) call. Before CL 392834, escape analysis would see (Object).Initialize(obj).Update() call. But now it would see:
tmp := (*Object).Initialize(obj)
tmp.Update()
We have a special case for ONEW inside RewriteNonNameCall, which will help a better escape analysis if we not rewriting at all.
(Object).Initialize(obj) is the callee expression of the call expression (Object).Initialize(obj).Update()
Sorry, I must still have missed something... If by "callee expression" you mean the function being called, in this case it would be (*Object).Initialize(obj).Update
, instead of (*Object).Initialize(obj)
.
I think it's reasonable for escape analysis to behave like that
I don't think so, sorry. In fact, if I rewrite the source code of BenchmarkAllocs
to
tmp := obj.Initialize()
tmp.Update()
it actually does not allocate. Also not allocate with
tmp := (*Object).Initialize(&obj)
tmp.Update()
Sorry, I must still have missed something... If by "callee expression" you mean the function being called, in this case it would be (Object).Initialize(obj).Update, instead of (Object).Initialize(obj).
It is! In case of ODOTMETH
, ODOTINTER
, OMETHVALUE
, we peel away the selector, so we have (*Object).Initialize(obj)
. That being said, for x.g()
, it's a Name call, because after peeling away selector in callee expression, we have x
left. But for x.f().g()
, we have x.f()
left, which is a non-Name call.
I don't think so, sorry. In fact, if I rewrite the source code of BenchmarkAllocs to
Hmm, sounds like there's actual mismatch between emitting autotmp vs hand written one. In case of hand written one, there's a ODCL
node for tmp
at the top of for body. Why in case of emitting autotmp, we haven't:
tmp := Temp((*np).Type())
as := ir.NewAssignStmt(base.Pos, tmp, *np)
as.Def = true
So you're right that fixing the AST (emitting the ODCL for autotmp) will fix the problem.
I still think this is incorrect behavior by escape analysis, but it is taking me a while to figure out exactly where it goes wrong. I've been stepping through the compiler in a debugger since yesterday (as time/focused-attention permit). But thanks for all the investigation, this will tell me where to be especially attentive.
Also, everything gets inlined. This is in code that no longer contains real calls, though there are INLCALLs. Here's the pruned test case:
package foo
var output int
type Object struct {
Val int
}
type B struct {
N int
}
func (io *Object) Initialize() *Object {
io.Val = 5
return io
}
func (uo *Object) Update() *Object {
uo.Val = uo.Val + 1
return uo
}
func BenchmarkAllocs(b *B) {
for i := 0; i < b.N; i++ {
var obj Object
obj.Initialize().Update()
output = obj.Val
}
}
Also, everything gets inlined. This is in code that no longer contains real calls,
Yeah, but all discussion above happens during old typecheck pass, so before any inlining can happen.
The escape analysis runs on inlined code. That's where the decision is made wrong, in my opinion. If this AST is actually malformed, we need some way of rejecting it. Are autotmps required to have ODCL nodes, or would it just solve this problem for us? I just now went looking at uses of typecheck.Temp
and I don't see them getting declared; rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.
Are autotmps required to have ODCL nodes, or would it just solve this problem for us? I just now went looking at uses of typecheck.Temp and I don't see them getting declared; rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.
No, autotmps do not require to ave ODCL node, but emitting ODCL
will help escape analysis record the loop depth correctly.
rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.
What do you mean "50"? Is this the diff in CL 459295 or something else? If you refer to the diff, you may want to look at latest patchset, which is just one line change.
The problem is loopdepth, but I think this is a bug in escape analysis -- anytime there is an autotmp used in a loop, its loopdepth will be wrong unless it is declared, and so far, none of them are declared.
"50" refers to places in the compiler source code where we create temporaries; any of these with a pointer type is conceivably a source of such an escape analysis hiccup. And, also, adding all these declarations for temporaries where none existed before might cause other problems. I think the heuristic to try is that the loop level of an autotmp is the minimum level at which it is assigned; that is where it would have been declared.
And, also, adding all these declarations for temporaries where none existed before might cause other problems.
I'm AFK, but there's a place in compiler where we emit ODCL for autotmp, I cant find it offhand, though.
I think the heuristic to try is that the loop level of an autotmp is the minimum level at which it is assigned; that is where it would have been declared.
Hmm, I think it can apply not just for autotmp, but any local ONAME that appear in the LHS of the assignment. If as.Def
is true, we should record the loopdepth.
ONAMEs can have assignments at any depth. It's important to record the least-nested visibility because that is a form of escape (many-to-one). For autotmps, we do seem to declare them, and we do seem to declare them at varying depth (i.e., I logged this stuff complete with stack traces) and it "seems" to get them right for 1.20. I think for 1.19, this is just a minor performance regression and we live with it. Arguably, as the keeper of many benchmarks, I should add this microbenchmark, since it would be nice not to regress again.
Sorry for the late reply. I agree with @dr2chase that targeting the escape analysis is the right thing to do here. And the heuristic of using the minimum level where an autotmp is assigned is reasonable to me (unless it is otherwise declared).
Sorry for the late reply. I agree with @dr2chase that targeting the escape analysis is the right thing to do here. And the heuristic of using the minimum level where an autotmp is assigned is reasonable to me (unless it is otherwise declared).
Agree, but if we're going to backport, then emitting ODCL seems safer. Otherwise, we can fix escape analysis in 1.21 cycle.
Change https://go.dev/cl/459496 mentions this issue: cmd/compile: record loopdepth for autotmp if declared in assignment
Besides the escape analysis change, I think the escape analysis diagnostics also needs some improvement. If it is the assignment .autotmp_3 = ~R0
causing the escape due to loop depth, it would be good to mention loop depth. Otherwise it is very confusing why this assignment causes problem.
Besides the escape analysis change, I think the escape analysis diagnostics also needs some improvement. If it is the assignment
.autotmp_3 = ~R0
causing the escape due to loop depth, it would be good to mention loop depth. Otherwise it is very confusing why this assignment causes problem.
With -m=3
, loop depth is included in diagnostics:
p.go:26:26:[2] BenchmarkAllocs stmt: .autotmp_3 := ~R0
[2]
-> loop depth is 2
.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Ran the following benchmarks:
What did you expect to see?
I expected neither benchmark to perform any allocations.
What did you see instead?
BenchmarkAllocs
allocated obj onto the heap, butBenchmarkNoAllocs
does not:However, it does not do this in 1.18.9: