Open josharian opened 8 years ago
I'm going to toss in a few more ideas to consider.
An unexported function that is only called once is often cheaper to inline.
Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.
Functions that only make function calls in error cases, which is fairly common, can be cheaper to handle as a mix of inlining and outlining: you inline the main control flow but leave the error handling in a separate function. This may be particularly worth considering when inlining across packages, as the export data only needs to include the main control flow. (Error cases are detectable as the control flow blocks that return a non-nil value for a single result parameter of type error
.)
One of the most important optimizations for large programs is feedback directed optimization aka profiled guided optimization. One of the most important lessons to learn from feedback/profiling is which functions are worth inlining, both on a per-call basis and on a "most calls pass X as argument N" basis. Therefore, while we have no FDO/PGO framework at present, any work on inlining should consider how to incorporate information gleaned from such a framework when it exists.
Pareto optimal is a nice goal but I suspect it is somewhat unrealistic. It's almost always possible to find a horrible decision made by any specific algorithm, but the algorithm can still be better on realistic benchmarks.
Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.
A common case where this would apply is when calling marshallers/unmarshallers that use reflect
to inspect interface{}
parameters. Many of them tend to have a "fast-path" in case reflect.Type
is some basic type or has some basic property. Inlining that fast-path would make it even faster. Eg: see binary.Read
and think how much of its code could be pulled when the value bound to the interface{}
argument is known at compile-time.
Along the lines of what iant@ said, it's common for C++ compilers take into account whether a callsite appears in a loop (and thus might be "hotter"). This can help for toolchains that don't support FDO/PGO or for applications in which FDO/PGO are not being used.
No pragmas that mandates inline, please.
I already expressed dislike for //go:noinline, and I will firmly object any proposal for //go:mustinline or something like that, even if it's limited to runtime.
If we can't find a good heuristics for the runtime package, I don't think it will handle real-world cases well.
Also, we need to somehow fix the traceback for inlined non-leaf functions first.
Another idea for the inlining decision is how simpler could the function body be if inlined. Esp. for reflect using functions that has fast paths, if the input type matches the fast path, even though the function might be very complicated, the inlined version might be really simple.
Couldn't we obtain a minor improvement in the cost model by measuring the size of generated assembly language? It would require preserving a copy of the tree till after compilation, and doing compilation bottom-up (same way as inlining is scheduled) but that would give you a more accurate measure. There's a moderate chance of being able to determine goodness of constant parameters at the SSA-level, too.
Note that this would require rearranging all of these transformations (inlining, escape analysis, closure conversion, compilation) to run them function/recursive-function-nest at-a-time, so that the results from compiling bottom-most functions all the way to assembly language would be available to inform inlining at the next level up.
doing compilation bottom-up
I have also considered this. There'd be a lot of high risk work rearranging the rest of the compiler to work this way. It could also hurt our chances to get a big boost out of concurrent compilation; you want to start on the biggest, slowest functions ASAP, but those are the most likely to depend on many other functions.
It doesn't look that high risk to me; it's just another iteration order. SSA also gives us a slightly more tractable place to compute things like "constant parameter values that shrink code size", even if it is only so crude as looking for blocks directly conditional on comparisons with parameter values.
I think we could test the inlining benefits of the bottom-up compilation pretty easily. One way is to do it just for inter-package compilation (as suggested above); another is to hack cmd/compile to dump the function asm size somewhere and then hack cmd/go to compile all packages twice, using the dumped sizes for the second round.
An unexported function that is only called once is often cheaper to inline.
Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.
Also, just to understand, in -buildmode=exe
basically every single function beside the entry function is going to be considered unexported?
An unexported function that is only called once is often cheaper to inline.
Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.
It is not true when the code looks like
func internal() {
// Large complex function with loops.
}
func External() {
if veryRareCase {
internal()
}
}
Because in the normal case where you don't need to call internal
you don't have set up a stack frame.
Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?
In package main, yes.
Because in the normal case where you don't need to call internal you don't have set up a stack frame.
Oh I see, makes sense. Would be nice (also in other cases) if setting up the stack frame could be sunk in the if, but likely it wouldn't be worth the extra effort.
Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?
In package main, yes.
The tyranny of unit-at-a-time :D
Functions that start with a run of if param1 == nil { return nil }
where the tests and return values are simple parameters or constants would avoid the call overhead if just this part was inlined. The size of setting up the call/return could be weighed against the size of the simple tests.
@RalphCorderoy I've been thinking about the same kind of function body "chunking" for early returns. Especially interesting for quick paths, where the slow path is too big to inline.
Unless the compiler chunks, it's up to the developer to split the function in two I presume.
Hi @mvdan, Split the function in two with the intention the compiler then inlines the non-leaf first one? Another possibility on stripping some of the simple quick paths is the remaining slow non-inlined function no longer needs some of the parameters.
Yes, for example, here tryFastPath
is likely small enough to be inlined (if forward thunk funcs were inlineable, see #8421):
func tryFastPath() {
if someCond {
// fast path
return ...
}
return slowPath()
}
Too late for 1.9.
Change https://golang.org/cl/57410 mentions this issue: runtime: add TestIntendedInlining
Another example of current inlining heuristic punishing more readable code
if !foo {
return
}
if !bar {
return
}
if !baz {
return
}
//do stuff
is more "expensive" than
if foo && bar && baz {
// do stuff
}
return
This is based on real code from regexp package (see https://go-review.googlesource.com/c/go/+/65491 for details)
Another issue with inliner, is that cost of function with inlined call to some other function has higher cost than manual inlining. Consider following example:
package main
var a int
var b int
func foo() { a = b }
func f00() { foo() }
func f01() { f00() }
func f02() { f01() }
func f03() { f02() }
func f04() { f03() }
func f05() { f04() }
func f06() { f05() }
func f07() { f06() }
func f08() { f07() }
func f09() { f08() }
func f10() { f09() }
func f11() { f10() }
func f12() { f11() }
func f13() { f12() }
func f14() { f13() }
func f15() { f14() }
func f16() { f15() }
func f17() { f16() }
func f18() { f17() }
func f19() { f18() }
func f20() { f19() }
func f21() { f20() }
func f22() { f21() }
func f23() { f22() }
func f24() { f23() }
func f25() { f24() }
func f26() { f25() }
func f27() { f26() }
func f28() { f27() }
func f29() { f28() }
func f30() { f29() }
func f31() { f30() }
func f32() { f31() }
func f33() { f32() }
func f34() { f33() }
func f35() { f34() }
func f36() { f35() }
func f37() { f36() }
func f38() { f37() }
func f39() { f38() }
func main() {
f39()
}
It should be optimized into:
package main
var a int
var b int
func main() {
a = b
}
However after each inlining cost of new function is increased by 2, causing (with -m -m): ./foo.go:47:6: cannot inline f38: function too complex: cost 81 exceeds budget 80 This caused by inliner adding new goto to/from inlined body and new assignments for each argument/return value. For simple case (single return/no assignments to arguments) we should probably adjust cost of inlining outer functions by something like 2+number_of_arguments + number_of_return_values. I've tried simply raising limit by 20 and e. g. image/png/paeth gets significantly faster:
Paeth-6 6.83ns ± 0% 4.39ns ± 0% -35.73% (p=0.000 n=9+9)
Because now paeth with inlined abs is itself inlinable.
Another inlining heuristic idea: if a small function is called with a func literal argument, inline the small function, and "inline" (or is it a constant fold?) the call to the now-constant func argument. That would enable zero-cost control flow abstraction. This may depend on how many times the argument is called: if it's just once then there's no reason not to, but if it's more than once then it may depend on the complexity of the argument.
For example, in CL 109716 it would have been nice to abstract out the high-performance bitmap iteration pattern, but there's no way to do that right now without adding a fair amount of overhead. With this heuristic, it would have been possible to lift that pattern into a function at no cost, where the loop body was supplied as a func literal argument.
Change https://golang.org/cl/125516 mentions this issue: cmd/compile: set stricter inlining threshold in large functions
In the commit message of CL 125516, @randall77 commented:
At some point it might be nice to have a better heuristic for "inlined body is smaller than the call", a non-cliff way to scale down the cost as the function gets bigger, doing cheaper inlined calls first, etc.
(Copied here so that it is easier to find by folks interested in improving inlining heuristics.)
In a Gophercon talk right now, the presenter demoed removing the use of math/bits intrinsics to remove the inlining budget of those "function calls". Unfortunate.
We give calls to intrinsics a cost of 1. They don't get a full call's cost. What am I missing?
Maybe nothing. The call was to math/bits.RotateLeft32
. I don't see that in the list of intrinsics in cmd/compile/internal/gc/ssa.go. (But maybe there is someplace else to look that I don't know about.)
Ah yes, that's not an intrinsic. We detect the rotate directly from the operations in the body of that function.
The body looks like it has ~7 operations in it, but it reduces to just one instruction in the end.
Should we start pretending we had feedback information and that we had stored in a file, and use that to guide inlining?
hotsites=["blake_foo.go:17(math/bits.RotateLeft32)","blake_f..go:19(math/bits.RotateLeft32)"] hotfuncs=["whatever.blake/g"]
Make that something the compilation depends on, if we change the file, we force a recompilation.
Ah yes, that's not an intrinsic.
Short term, perhaps we should just make it an intrinsic (for AMD64). Then it behaves appropriately end-to-end, including inlining. That's pretty easy and safe. Maybe I can help someone get this in tomorrow at community day.
There are many long term fixes, of course.
Here's a test case we can add:
// +build amd64
// errorcheck -0 -m
// Copyright 2018 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Test that inlining of math/bits.RotateLeft* treats those calls as intrinsics.
package p
import "math/bits"
var (
x8 uint8
x16 uint16
x32 uint32
x64 uint64
x uint
)
func f() { // ERROR "can inline f"
x8 = bits.RotateLeft8(x8, 1) // ERROR "inlining call"
x16 = bits.RotateLeft16(x16, 1) // ERROR "inlining call"
x32 = bits.RotateLeft32(x32, 1) // ERROR "inlining call"
x64 = bits.RotateLeft64(x64, 1) // ERROR "inlining call"
x = bits.RotateLeft(x, 1) // ERROR "inlining call"
}
This fails with 1.11 and should pass.
Unless I hear otherwise, @andybons and I will make this an intrinsic tomorrow.
Change https://golang.org/cl/132435 mentions this issue: cmd/compile: make math/bits.RotateLeft* an intrinsic on amd64
Any chance this gets in for go 1.12?
Someone needs to work on it. And have an idea of how to do it. Other than that, it's ready to go :)
@randall77 ;) I wish I knew enough about it to jump it - I would be all over it. This ... just ... feels like low-hanging fruit towards closing the performance gap against C/Rust/etc, and it seems to have been on the radar for a few years now.
Change https://golang.org/cl/141117 mentions this issue: cmd/compile: remove some inl budget hacks
Change https://golang.org/cl/163760 mentions this issue: cmd/compile: make math/bits.RotateLeft{32,64} intrinsics on ppc64x
I just had to write this crime against safety to appease the inliner:
// lightReduce brings the limbs below 52, 51, 51, 51, 51 bits. It is split in
// two because the inliner works actively against us. The two functions MUST be
// called one after the other.
func (v *FieldElement) lightReduce1() *FieldElement {
v[1] += v[0] >> 51
v[0] = v[0] & maskLow51Bits
v[2] += v[1] >> 51
v[1] = v[1] & maskLow51Bits
v[3] += v[2] >> 51
return v
}
func (v *FieldElement) lightReduce2() *FieldElement {
v[2] = v[2] & maskLow51Bits
v[4] += v[3] >> 51
v[3] = v[3] & maskLow51Bits
v[0] += (v[4] >> 51) * 19
v[4] = v[4] & maskLow51Bits
return v
}
It's not just me. I have now heard systematically from anyone who writes high performance Go cryptography that a big reason they have to resort to assembly for big functions, or to making their code unreadable, is that they can't control the inliner.
I've been telling them that in Go we want to find good defaults that work without annotations, but I am less and less convinced by that argument myself. If all we are doing by rejecting //go:inline
is cause the parent function to be marked //go:noescape
and rewritten in assembly, or force them to do manual inlining (that is, copy paste without docs and meaningful function name) I don't think we are making anyone a service.
More fundamentally, as the inliner gets smarter, it will also get less predictable. For example, if you reuse a function which was inlined because it was only used once (from @ianlancetaylor's https://github.com/golang/go/issues/17566#issuecomment-255793777), you'll make code you didn't touch dramatically slower. That also feels very un-Go-like. You wouldn't expect to have to run the benchmarks for everything every time you barely touch anything, and you wouldn't want to have to train everyone who works on the same codebase to do so. But then the only way to be sure performance will be reliably acceptable going forward is to write assembly :(
A FDO/PGO framework might fix that, but short of that, I think there's an unavoidable tension between avoiding annotations and avoiding subtle behavior, and I think it might be time to re-evaluate the trade-off.
I think part of the trouble here is that inlining is performed before any SSA passes. This is (IMO) too early in compilation to make intelligent inlining decisions. It stands to reason that the inlining code wouldn't have to be nearly as 'clever' if it actually knew how many instructions were in each function, rather than guessing by looking at the AST.
@philhofer at least across packages, https://github.com/golang/go/issues/25999 might help with that.
@FiloSottile this is obviously not the point, but does it help if you write eg v[0] &=
instead of v[0] = v[0] &
?
It's not just me. I have now heard systematically from anyone who writes high performance Go cryptography that a big reason they have to resort to assembly for big functions, or to making their code unreadable, is that they can't control the inliner.
The same happens when writing other kind of performance code, like in my emulators. I've been looking at Go generated code for many years now, and I've been trying to use Go to generate highly optimized code. I believe that the current awful inlining capability is the single most offending point in the compiler, with respect to generated code quality. I think what's worse is that it looks like nobody is working on it right now, as far as I can tell.
Given the current situation and the fact that it's unlikely to change soon, I also vote for adding an inlining hint.
@FiloSottile, what cost does inlining reduce in that particular case?
@FiloSottile this is obviously not the point, but does it help if you write eg v[0] &= instead of v[0] = v[0] &?
@josharian Not enough. I could start playing games where I replace slice lookups with local variables, but the readability cost and unreliability of the result are kind of my point. A call to runtime.Inline()
would be clearer than // Don't refactor! It will make ScalarMult 5% slower!
.
@FiloSottile, what cost does inlining reduce in that particular case?
@aclements Function call overhead in functions used themselves in very hot loops. I actually wish I could benchmark inlining Mul and Add themsleves. I suspect I would see a 5-10% speedup in higher-level benchmarks.
name old time/op new time/op delta
Add-8 6.72ns ± 2% 7.87ns ± 2% +17.02% (p=0.000 n=10+8)
Mul-8 29.7ns ± 1% 30.8ns ± 3% +3.93% (p=0.000 n=9+10)
Ideally the compiler's inlining heuristics would be smarter (e.g., basing decisions on SSA cost instead of AST), but this issue has been open for 2 years, and I don't see us making significant progress on moving inlining to SSA form in the near future.
If users are writing contorted code to workaround the compiler's inlining heuristics, I think it makes sense to just provide a //go:inline
hint directive to ignore those heuristics. It seems unlikely to me our heuristics will ever be perfect anyway, and as long as we need to support //go:noinline
, the incremental complexity to support //go:inline
seems pretty minimal.
@FiloSottile Do you want to file a cmd/compile proposal for //go:inline and we can discuss this more there, and keep this issue focused on improving the inlining heuristics?
The question I have for //go:inline
is, does it apply to functions or to callsites? It sounds like the proposal is the former, but I worry that people would really want the latter.
Callsite pragma sounds interesting to me. Maybe go a bit farther and say that such pragmas are required to be intra-package (caller and callee both in package X, and callee is not exported)? Even then a "must" inline pragmas is still a scary prospect. Well-intentioned users could easily blow up package compile times by accident. Also seems like the inliner would have to build some sort of call graph in order to detect cycles in must-inline directives.
Do you want to file a cmd/compile proposal for //go:inline and we can discuss this more there
This was proposed at https://github.com/golang/go/issues/21536. I will re-open that for further discussion.
@thanm:
Also seems like the inliner would have to build some sort of call graph in order to detect cycles in must-inline directives.
FWIW, cmd/compile already detects cycles to avoid inlining recursive calls:
@josharian:
This was proposed at #21536. I will re-open that for further discussion.
Thanks!
Maybe go a bit farther and say that such pragmas are required to be intra-package (caller and callee both in package X, and callee is not exported)?
For protobufs, that would be a bit too restrictive as we have internal packages with functionality we want inlined elsewhere. Either no-restriction would be preferred or restricted to being within a module.
@aclements Function call overhead in functions used themselves in very hot loops.
Is there a benchmark I could look at? I'm actually a bit surprised, since in general the inlining threshold is high enough that things above that already have low function call overhead (there are of course places where this doesn't apply, like if a lot of AST turns into very few instructions, or if there are many paths through a function but all are short). Usually the main benefit of inlining is to enable further function-level optimizations, but I don't think that's the case for your specific code.
One thing the inliner sadly lacks right now is any form of call-site heuristics. A call in a loop could, for example, have a higher inlining threshold, since the loop suggests that it will be execute more.
Perhaps the real answer in this case is that function call overhead needs to be lower. :) (#18597 and #31193, which I just filed).
if a lot of AST turns into very few instructions
This happens a lot. The inverse does as well (a few AST nodes turns into a lot of instructions), with append being the worst culprit.
The current heuristic really is awful.
if there are many paths through a function but all are short)
Example: https://github.com/golang/go/issues/30548
Usually the main benefit of inlining is to enable further function-level optimizations, but I don't think that's the case for your specific code.
Nor has it been in the cases where I have spent significant effort fighting the inliner. The benefit is generally avoiding function call overhead (register spills, jump, prologue, stack zeroing), and also removing a giant optimizer roadblock (a bunch of opaque memory operations).
One thing the inliner sadly lacks right now is any form of call-site heuristics.
FGO/PGO. :)
A call in a loop could, for example, have a higher inlining threshold, since the loop suggests that it will be execute more.
Also, calls with many params/results should have a higher inlining threshold, since just setting up the stack frame and reading the results takes more code.
I think what's worse is that it looks like nobody is working on it right now, as far as I can tell.
FWIW, I look at this on and off. I have experimented with manual tweaks and with simple machine machine learning. (I'd like to try complicated ML, but I know that it would never be merged.) I have experimented with the cross-package post-SSA export-data approach hinted at above; it is promising but intricate to implement.
The biggest problem is, as I said in the very first post that opened this issue, that inlining has significant effects on binary size, compilation time, and compiled code performance, which means that most changes involve complicated trade-offs. It requires a lot of work to accurate assess whether a particular change is an improvement.
The second biggest problem (at least for me) is that it is a bit too large a project to bite off as a hack-on-the-side-for-fun contributor. :)
The current inlining cost model is simplistic. Every gc.Node in a function has a cost of one. However, the actual impact of each node varies. Some nodes (OKEY) are placeholders never generate any code. Some nodes (OAPPEND) generate lots of code.
In addition to leading to bad inlining decisions, this design means that any refactoring that changes the AST structure can have unexpected and significant impact on compiled code. See CL 31674 for an example.
Inlining occurs near the beginning of compilation, which makes good predictions hard. For example,
new
ormake
or&
might allocate (large runtime call, much code generated) or not (near zero code generated). As another example, code guarded byif false
still gets counted. As another example, we don't know whether bounds checks (which generate lots of code) will be eliminated or not.One approach is to hand-write a better cost model: append is very expensive, things that might end up in a runtime call are moderately expensive, pure structure and variable/const declarations are cheap or free.
Another approach is to compile lots of code and generate a simple machine-built model (e.g. linear regression) from it.
I have tried both of these approaches, and believe both of them to be improvements, but did not mail either of them, for two reasons:
Three other related ideas:
cc @dr2chase @randall77 @ianlancetaylor @mdempsky