Closed Cylkal closed 4 months ago
Interestingly, this is fixed at tip after https://go-review.googlesource.com/c/go/+/578815
How was your default.pgo generated?
@cuonglm It was collected from my online service by using curl -o pprof.profile http://localhost:7999/debug/pprof/profile
, and the file has been processed. For example, using '--focus clickhouse' on the original sample file outputs only the samples related to ClickHouse. Of course, the original sample file can also trigger this issue.
I tried writing a separate demo program to perform batch insert operations into ClickHouse and found that this issue does not occur. This problem only appears when collecting samples from the online service.
CC @golang/compiler
@cuonglm I think you mean fixed by https://go.dev/cl/581436, no?
@cuonglm I think you mean fixed by https://go.dev/cl/581436, no?
No, because go1.22.3 includes this CL but still trigger ICE.
Unless my operation was wrong, my git bisect points to https://go-review.googlesource.com/c/go/+/578815 which does fix this issue.
@cuonglm thanks for that clarification, I hadn't realized that we backported 581436. Interesting.
Stack trace from go build -gcflags=all=-d=panic=1
:
goroutine 1 [running]:
runtime/debug.Stack()
runtime/debug/stack.go:24 +0x5e
cmd/compile/internal/base.FatalfAt({0x54f3e8?, 0xc0?}, {0xdf1b88, 0x9}, {0xc00054f418, 0x1, 0x1})
cmd/compile/internal/base/print.go:225 +0x1d7
cmd/compile/internal/base.Fatalf(...)
cmd/compile/internal/base/print.go:194
cmd/compile/internal/gc.handlePanic()
cmd/compile/internal/gc/main.go:52 +0x90
panic({0xd722c0?, 0xc000c10f30?})
runtime/panic.go:770 +0x132
cmd/compile/internal/escape.(*batch).initFunc(...)
cmd/compile/internal/escape/escape.go:199
cmd/compile/internal/escape.Batch({0xc000c0e638, 0x1, 0xc000a339e0?}, 0x0?)
cmd/compile/internal/escape/escape.go:139 +0x19de
cmd/compile/internal/ir.(*bottomUpVisitor).visit(0xc000c10480, 0xc000a339e0)
cmd/compile/internal/ir/scc.go:121 +0x2e2
cmd/compile/internal/ir.(*bottomUpVisitor).visit.func1({0xf64760?, 0xc000a339e0?})
cmd/compile/internal/ir/scc.go:78 +0x3b
cmd/compile/internal/ir.(*bottomUpVisitor).visit.func2({0xf64460, 0xc000be2ba0})
cmd/compile/internal/ir/scc.go:92 +0xf7
cmd/compile/internal/ir.Visit.func1({0xf64460, 0xc000be2ba0})
cmd/compile/internal/ir/visit.go:105 +0x2d
cmd/compile/internal/ir.(*CallExpr).doChildren(0xc000bdcfa0, 0xc000c20270)
cmd/compile/internal/ir/node_gen.go:298 +0x9c
cmd/compile/internal/ir.DoChildren(...)
cmd/compile/internal/ir/visit.go:94
cmd/compile/internal/ir.Visit.func1({0xf646a0, 0xc000bdcfa0})
cmd/compile/internal/ir/visit.go:106 +0x51
cmd/compile/internal/ir.doNodes(...)
cmd/compile/internal/ir/node_gen.go:1797
cmd/compile/internal/ir.(*ReturnStmt).doChildren(0xc000bc7cc0, 0xc000c20270)
cmd/compile/internal/ir/node_gen.go:1231 +0xbc
cmd/compile/internal/ir.DoChildren(...)
cmd/compile/internal/ir/visit.go:94
cmd/compile/internal/ir.Visit.func1({0xf649a0, 0xc000bc7cc0})
cmd/compile/internal/ir/visit.go:106 +0x51
cmd/compile/internal/ir.doNodes(...)
cmd/compile/internal/ir/node_gen.go:1797
cmd/compile/internal/ir.(*Func).doChildren(0xf64760?, 0xc000c20270)
cmd/compile/internal/ir/func.go:177 +0x4e
cmd/compile/internal/ir.DoChildren(...)
cmd/compile/internal/ir/visit.go:94
cmd/compile/internal/ir.Visit.func1({0xf64760, 0xc000be1440})
cmd/compile/internal/ir/visit.go:106 +0x51
cmd/compile/internal/ir.Visit({0xf64760, 0xc000be1440}, 0xc000c0f080)
cmd/compile/internal/ir/visit.go:108 +0xb3
cmd/compile/internal/ir.(*bottomUpVisitor).visit(0xc000c10480, 0xc000be1440)
cmd/compile/internal/ir/scc.go:84 +0x1bf
cmd/compile/internal/ir.VisitFuncsBottomUp({0xc000a76488, 0x43, 0x2?}, 0xe2af68)
cmd/compile/internal/ir/scc.go:58 +0xb8
cmd/compile/internal/escape.Funcs(...)
cmd/compile/internal/escape/escape.go:122
cmd/compile/internal/gc.Main(0xe2adb0)
cmd/compile/internal/gc/main.go:256 +0x1076
main.main()
cmd/compile/main.go:57 +0xf9
so this appears to be an escape analysis problem, at least initially.
OK, I spent a little while debugging this.
Escape analysis is bombing in this part of initFunc:
// Initialize resultIndex for result parameters.
for i, f := range fn.Type().Results() {
e.oldLoc(f.Nname.(*ir.Name)).resultIndex = 1 + i
}
because f.Nname is nil as opposed to pointing to the name for the function, which in this case is proto.Nullable[uint64].IsSet
.
What's happening here is that at some point during the compilation of the offending package github.com/ClickHouse/clickhouse-go/v2/ext
we encounter a reference to this specific generic function, and the code that reads the export data marks it for inclusion/compilation within the current package. The key bit of code here is this:
What's supposed to happen at that point is that we'll read in the body for the function, and in fact there is a call to r.addBody
later in funcExt
, e.g. here.
The problem: at the point of that call, r.curfn
is nil, and as a result instead of immediately reading in the body, it gets tacked onto the todoBodies
list here.
Ordinarily this would be fine, but: at the point that this happens we're not doing a "regular" read of the export data, we're looking things up as part of PGO devirtualization. Here's the stack trace:
cmd/compile/internal/noder.(*reader).funcExt /w/go1.22/src/cmd/compile/internal/noder/reader.go:1110
cmd/compile/internal/noder.(*reader).method /w/go1.22/src/cmd/compile/internal/noder/reader.go:1039
cmd/compile/internal/noder.(*pkgReader).objIdxMayFail /w/go1.22/src/cmd/compile/internal/noder/reader.go:818
cmd/compile/internal/noder.(*pkgReader).objIdx /w/go1.22/src/cmd/compile/internal/noder/reader.go:672
cmd/compile/internal/noder.(*pkgReader).objInstIdx /w/go1.22/src/cmd/compile/internal/noder/reader.go:664
cmd/compile/internal/noder.(*reader).obj /w/go1.22/src/cmd/compile/internal/noder/reader.go:636
...
<a couple hundred similar frames elided>
...
cmd/compile/internal/noder.(*reader).doTyp /w/go1.22/src/cmd/compile/internal/noder/reader.go:524
cmd/compile/internal/noder.(*pkgReader).typIdx /w/go1.22/src/cmd/compile/internal/noder/reader.go:431
cmd/compile/internal/noder.(*reader).typWrapped /w/go1.22/src/cmd/compile/internal/noder/reader.go:390
cmd/compile/internal/noder.(*pkgReader).objIdxMayFail /w/go1.22/src/cmd/compile/internal/noder/reader.go:807
cmd/compile/internal/noder.lookupMethod /w/go1.22/src/cmd/compile/internal/noder/unified.go:113
cmd/compile/internal/noder.LookupFunc /w/go1.22/src/cmd/compile/internal/noder/unified.go:65
cmd/compile/internal/pgo.addIndirectEdges /w/go1.22/src/cmd/compile/internal/pgo/irgraph.go:443
cmd/compile/internal/pgo.createIRGraph /w/go1.22/src/cmd/compile/internal/pgo/irgraph.go:291
cmd/compile/internal/pgo.New /w/go1.22/src/cmd/compile/internal/pgo/irgraph.go:192
cmd/compile/internal/gc.Main /w/go1.22/src/cmd/compile/internal/gc/main.go:221
So essentially we wind up with some dangling references on todoBodies
, and we never actually read the body of the function in question.
Not sure yet what the best overall approach is here. I am also unsure as to why this problem would suddenly go away on tip, since as far as can see there haven't been specific changes there that would ameliorate the problem.
More work needed. CC @prattmic for additional eyeballs relating to PGO stuff.
@thanm Thanks for taking a look. noder.LookupFunc
does not currently handle instantiating generic functions, so that part is unsurprising. However, the expectation is that LookupFunc would simply never find the symbol name that appears in the profile at all because that includes the .go.shape.blah
component that doesn't exist in the high-level symbol name.
Since this is a method, we first look up the type, then search for the method. Perhaps one of the other methods on proto.Nullable[uint64]
has already been instantiated, so we successfully find the type after all, but IsSet
itself isn't instantiated? Just a thought, I'm not too familiar with instantiations.
However, the expectation is that LookupFunc would simply never find the symbol name that appears in the profile at all because that includes the .go.shape.blah component that doesn't exist in the high-level symbol name
Yes, this is certainly true. However what's happening here is that the pgoir stuff is looking for (and finding) the method
github.com/ClickHouse/clickhouse-go/v2/lib/column.(*Map).AppendRow
Nothing wrong there, but the body of AppendRow
refers to a struct type that has a data member of generic type (proto.Nullable[uint64]
) which has a method proto.Nullable[uint64].IsSet
) and it is that generic method that is causing the problem.
The detailed sequence of events is as follows:
cmd/compile/internal/pgo.addIndirectEdges runs and pulls in the IR for column.(*Map).AppendRow
since it is a hot indirect target.
the importing process leaves a lot of dangling functions on the todoBodies
list (as described above), including 'proto.Nullable[uint64]'. In addition, the imported adds proto.Nullable[uint64]
to the needWrapperTypes
list.
The inliner runs, but it turns out that the top level package being compiled, github.com/ClickHouse/clickhouse-go/v2/ext
is unusual and it has zero inlining opportunities. If the inliner had run, it would have made a call to readBodies
, but this never happens.
Wrapper generation runs (after inlining, before escape analysis). Since proto.Nullable[uint64]
is on the needWrapperTypes list, we manufacture pointer wrappers for all its methods, including proto.(*Nullable[uint64]).IsSet
. This new wrapper has a direct call to proto.Nullable[uint64].IsSet
(which still doesn't have a func body).
Now escape analysis runs. It processes the two IsSet methods, at which point it crashes.
So we have to have a lot of things lining up in exactly the right way for this to happen.
I am pretty sure that the right fix here is to add a cleanup call to readBodies
from pgo.createIRGraph
, but I am having a heck of a time trying to create a minimized test case that will reproduce the problem.
The thing I keep stumbling over in my smaller test case is that the generic type in question gets added to needWrapperTypes
(which is what I want) but also at some later point gets added to haveWrapperTypes
, which means we don't get the necessary wrapper. Very frustrating. I'll keep hacking away on the minimization.
@thanm I double check and the CL that changes to reflect package actually "fix" this issue.
It's very weird.
Change https://go.dev/cl/591076 mentions this issue: cmd/go: new test case for pgo-devirt compile problem
Change https://go.dev/cl/591075 mentions this issue: cmd/compile/internal: add a PGO devirt post-lookup cleanup hook
Go version
go version go1.22.3 linux/amd64
Output of
go env
in your module/workspace:What did you do?
I have a project that uses ClickHouse, and it compiles successfully. However, when the pprof samples collected online service are placed in the main package of default.pgo, the compilation fails.
I apologize for not being able to provide the source code directly, but here are the pprof sample file
default.pgo
and the codemain.go
that can reproduce the compilation error.What did you see happen?
When the PGO file is removed, i.e., when PGO is disabled, it compiles successfully.
The PGO file comes from my online service, please rename the file profile.pb.gz to default.pgo. profile.pb.gz
What did you expect to see?
Build pass when enable PGO