cmd/compile: enhancements to the gc inliner and optimizer

tildeleb commented 9 years ago

Background I recently tried optimizing some hash functions in Go to find out what is possible without resorting to using assembler. I knew gc had an "inliner" but I didn't have any experience with it. I spent some time reading the "inl.c" code.

Summary Go's inliner seems to work well and I got speed increase of 30% on some hash functions using it. However, the way it is currently configured and controlled effectively turns if off for most cases. Most users won't figure out how to set the level (-gcflags="-l -l -l -l -l -l -m") and even less will be willing to recompile the tool chain to change the hard coded hairiness variable set to more than 40. The inliner would be more effective if it could handle closures and variadic functions. Finally, a simple idiom recognizer would allow a simple function such as rot() to be turned into a single instruction. Taken together these changes could yield significant performance improvements to certain kinds of Go packages.

Here are my impressions, suggestions, and requests:

A simple code example is here, it's just one one Jenkin's Hash functions: https://play.golang.org/p/rmDG7MR5F9 This comes from: https://github.com/tildeleb/hashland/blob/master/jenkins/jenkins.go The main project is: https://github.com/tildeleb/hashland/

Observations and Impressions

The inliner works pretty well but you need to add the following flags to build to get it work: go build -gcflags="-l -l -l -l -l -l -m"
.The inliner has budget for the complexity of functions it is willing to inline. The variable is called hairiness. It is current set to 40 and that's not very hairy. In the example above. I had to break mix() and final() into 3 functions of two lines each, to get them to inline. 40 really seems like a very small budget to me. Hard coding a parameter like this is just wrong.
Currently closures can't be inlined. Using the example above, that means the closures mix() and final() which close over the hash state variables a, b, and c need to moved out of the function "HashWords". Since the state variables a, b, and c are lo longer closed over they need to be passed to and returned from the mix() and final() functions. For example: a, b, c = mix(a, b, c). If all hash functions had 3 state variables that might be OK.

However, other more advanced hash functions have more state. For example in Jenkin's SpookyV2 I had to write: h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11 = Mix(inul, h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11)

Which I hope we can all agree is out of control and is why I believe why closures are the preferred solution to many inlining based use cases.

The downside of the closures is that if you have multiple versions of a hash function that use the same mix() and final() code, those closures need to be duplicated inside the top level functions.

Again, I prefer the closures. The source code is cleaner, easier to read, and probably easier to optimize because the optimizer doesn't need to optimize out the calling and return stack push/pop sequences to get good code quality.

Go needs an idiom recognizer. Consider the example again. The function rot(x, k) gets inlined but it doesn't generate a rot insturction. Other compilers are able to do this. Actually, I think a case can be made to add a rotate (no carry) operator (>>> or <<<) to the language for this particular case, but that's another "issue" as they say. BTW Go generates about 22 instructions to get the rotate done.
Variadic functions aren't inlined. I am considering proposing the following as a new interface for hash functions that would be complementary to the current hash streaming interface.

Hash32(b []byte, seeds ...uint32) uint32
Hash64(b []byte, seeds ...uint64) uint64
Hash128(b []byte, seeds ...uint64) (uint64, uint64)
Hash(in []byte, out []byte, seeds ...uint64) []byte

Because of this I would like to see inlining of variadic functions.

FWIW I still have to benchmark this vs other alternative calling sequences.

Suggestions

Increase the default "hairiness" from 40. 40 really isn't very much. Set the default "level" to a value higher that 0 or 1.
Optimizer command line flags. In the short term add temporary native command line flags to set hairiness and level. I know the Go team eschews command line flags but some control over the optimization process has been around since the unix epoch and is probably inevitable.

For example: % go build -ilbudget=120 illevel=6

Extend the inliner to inline closures and variadic functions.
Over the medium term add a simple idiom recognizer so common idioms such as rot in the example above can be turned into single instructions, really (2-3 instructions counting the MOVQ to the registers instead of the 20+ now generated to get a rot.
Consider adding a rot operator to the language.

minux commented 9 years ago

gc does recognize constant rotates. It's not done by the inliner, it's recognized in the walkrotate() function in cmd/gc/walk.c For unconstrained rotates, it does need to do a lot additional checks. Blindly using the rotate instruction might not what the user wants.

Fortunately, most modern crypto/hashing algorithms use constant rotates, so i don't think recognizing variable rotate matters that much.

I agree that inlining of function closure that doesn't escape will be very beneficial. I certainly used it a lot.

tildeleb commented 9 years ago

@minux Thanks for the info about the rotates, much appreciated. I think the comment in walk.c says what I was trying to say. // TODO: Could allow s and 32-s if s is bounded (maybe s&31 and 32-s&31).

Still I think it bears repeating that I saw it a comment somewhere that GCC recognizes this func rot(x, k uint32) uint32 { return x << k | x >> (32 - k) } as the idiom, without the &31. I haven't verified that yet, but I will at some point.

I guess I can get what I want today for the ROT instruction if I am willing to flush the function call to rot and expand the shifts by hand, but at some loss of clarity as to what is going on in the algorithm.

So a ^= rot(c, 4) becomes a ^= c<<4 | c>>28

Could the compiler do the &31 or &63 for the shift/rotate amounts for you?

minux commented 9 years ago

I expect the deficiencies in the current gc compiler will go away when they are rewritten into Go and another target-independent middle end is introduced.

tildeleb commented 9 years ago

One of my points was a lot could be gained from the current compiler by increasing hairiness and level and adding command line flags to set them. That is not a large request. We are talking about less than 10 lines of code.

I would like to see this added to 1.5. It makes sense (to me) to do it at the same time as the rewrite.

ianlancetaylor commented 9 years ago

We want the rewrite to be purely mechanical. Let's revisit this after the rewrite is complete.

minux commented 9 years ago

FYI, you can use -l=5 instead of repeating -l 5 times.

tildeleb commented 9 years ago

cc @randall77

tildeleb commented 9 years ago

How much do I have to pay to get just a switch set hairiness? I'd settle for that in 1.5. It would affect the rewrite at all not any automated testing. You can leave hairiness == 40.

In return I'd have a much easier time getting speedups in Go for some hash functions.

My kingdom for a switch.

randall77 commented 9 years ago

Command-line flags are tricky. The problem is "go get". How do you tell it what flags you want? What if two packages you depend on want different flags? Incompatible flags? The ecosystem is much nicer if there is just one compiler for everything.

The current compiler does have some flags (like -l), but they are for debugging the compiler only, not for any production use.

We can certainly revisit the built-in settings for hariness and level.

tildeleb commented 9 years ago

As I said above I understand the desire to eschew all flags. On the other hand optimizers often have bugs and I have found the ability to have some control over them vital to my productivity over the years. e.g. I would have been blocked but I could proceed with the optimizer turned off.

You point about go get and packages it well taken and true.

My opinion is you will have a very hard time finding universal constants for hariness and level and people like me will be frustrated with your choice because you end up saying, "geez if hariness was just 1o more this would inline" and ditto for level. Or there is some onscure case that's broken for level n and the expediant fix is to decrement level by one until some other issue is fixed.

Here is a proposal.

Could source files contain a comment ala build tags that describes what kind of optimizations and other flags are needed? The benefits are probably obvious but I'll state them to be complete.

The action is strictly local to a single go file
Any compiler implementation is free to ignore them because they are comments

"Comment control" was just added for "go generate" so there is a precendent.

minux commented 9 years ago

No, thanks. Let's not introduce more magic comments. The compiler should do the right thing by default.

//go:generate is not for a precedent for this, because the actual Go compiler doesn't recognize them. It's the Go command that processes them.

tildeleb commented 9 years ago

I understand the sentiment expressed above. My supposition is that the proliferation of comment based control mechanisms is a symptom of something that is missing, namely a way to associate metadata with files, packages, and programs. I would make a proposal but I suspect it would be rejected. There seems to be some denial going on here.
If I had to add one flag to the compiler it would be "-noopt" or turn off all possible optimizations. For whatever reason I've found many compiler/optimizer bugs in my life.. the ability to turn off all optimization has saved my ass more than any other flag.
The current compiler has a ways to go before anyone can claim that it is doing the right thing. Just the other night I tried for create a very low overhead generic dispatch mechanism. You can't easily do it in Go. The core of any fast dispatch mechanism is usually a function with a switch statement that uses contiguous integers to create an indirect dispatch table that can be inlined. I don't even know if Go does the latter, but switch doesn't inline so even if it does, you can't get low overhead. Sigh
As far as "revisit the built-in settings for hariness and level" I would say level needs to be 6 and hairiness has to be over 1,000. Is anyone willing to do that?
Enhancing the inliner to inline closures and switch statements would also be very helpful. I honestly believe there could be a lot of bang for the buck in doing all this as far as performance goes. Another way to say it is there is low hanging fruit here.

davecheney commented 9 years ago

-N should disable compiler optimisations.

bradfitz commented 9 years ago

No need for the tone.
Yes. We even run a linux-amd64-noopt builder at http://build.golang.org/ to make sure the noopt mode continues to work. We've had it break a couple times in the past when nobody is paying attention. Now the builder forces us to pay attention.
You talk about doing the "right thing" (which I read as correctness), and then you go on to talk about optimizations, not correctness bugs. But it's a tangent. Try not to lump too many issues together. If you need to pulpit to complain about Go, at least do it over separate bug reports.
What is the justification for 6 and 1000? Nobody is going to tweak a number without some analysis and rationale.
Yes. That would probably be a separate bug. I already don't know what this one is about anymore.

minux commented 9 years ago

Turning off optimization is easy, -gcflags "-N -l" will do.

Extreme inlining will bloat the binary, we need to do more investigation to weigh the trade-offs. (And I have to say Go binaries are already pretty large. I'm willing to go so far as to do some compression to reduce the Go binary size. We definitely shouldn't just blindly increase the level and hairiness)

Also, just because a function could be inlined, that doesn't mean we should inline it. GCC has fairly complex heuristics regarding when to inline and when not to inline.

PS: any non-leaf function will likely remain to be not inlined, otherwise that will change the stack trace, and we value debuggability better than raw performance there.

tildeleb commented 9 years ago

Sorry about the last two lines of #1164

I'll summarize the issues and requests below and sign off on this one. I apologize if this issue wandered around a bit. Everything here is related to the inliner and optimization.

" The compiler should do the right thing by default" didn't come from me. I think over the long term that's the correct approach. In the short and medium term other apporaches should be considered.

Must upsetting to me is the perception that I am complaining. I think frustrated would be better word. With a little effort the Go inliner could be much more effective. It's low hanging fruit.

My bad. I'll restate that.

dvyukov commented 9 years ago

PS: any non-leaf function will likely remain to be not inlined, otherwise that will change the stack trace, and we value debuggability better than raw performance there.

That's work for debug info. A single PC can be associated with a whole stack of inlined functions, and DWARF supports this. That will improve current situation as well, because currently we miss frames in CPU profiles and during nil derefs in inlined functions.

minux commented 9 years ago

On Sun, Jan 11, 2015 at 3:50 AM, Dmitry Vyukov notifications@github.com wrote:

PS: any non-leaf function will likely remain to be not inlined, otherwise that will change the stack trace, and we value debuggability better than raw performance there.

That's work for debug info. A single PC can be associated with a whole stack of inlined functions, and DWARF supports this. That will improve current situation as well, because currently we miss frames in CPU profiles and during nil derefs in inlined functions.

DWARF supports everything. But we certainly don't want to base our backtrace support on DWARF.

If we support showing backtraces for inlined functions, ideally I still want to see the argument values dump like we do now, but on the other hand, that's close to impossible to implement without invalidating all the benefit of inlining.

dvyukov commented 9 years ago

It's easy to do it our own traceback format. Yes, arguments of inlined functions will be missing.

tildeleb commented 9 years ago

Sorry I was gone for awhile. Thanks to @rsc for keeping this one open.

A few comments:

While it's an admirable goal to support complete backtraces for inlined functions I don't know if that needs to be a prerequisite to the inlining work. I'd be willing to live with a switch to turn inlining on and off.
I haven't checked yet but If the new (1.5+) compiler has some temporary flags for control variables such as hairiness, depth, and inline on/off I'd be willing to spend some time working through the various hash functions I have as well as parts of the Go standard library to find some reasonable default settings and back them up with some data.

What can I do to help?

ianlancetaylor commented 9 years ago

Correct backtraces pretty much is a prerequisite, because things like the standard library's log package rely on being able to count the number of frames they get from runtime.Callers. These can't be casually broken. We might be able to use some different approach, but that would have to be invented, proposed, planned, and implemented. Or, we could implement inlining such that backtraces work.

tildeleb commented 9 years ago

IDK. I always thought that inlining and a stack frame were antithetical. From my perspective if function A is inlined into B then function A ceases to exist as a function. Function A becomes more like a textual hygienic macro that was expanded. Do you agree with that?

The whole idea of inlining is twofold:

Avoid a stack frame and sav/ret overhead which can be very substantial. Admittedly this is not quite the same issue as maintaining a "stack trace" for debugging and I gather from the above that some symbol table formats have support for inlined functions.
Provide further opportunity for an optimizer after inlining such as CSE elimination, shared registers, various kinds of lifting, peephole optimizations and so on.

I guess the question is what's the overhead for maintaining a backtrace without a stack frame or will an inlined function require a stack frame just for debugging? If the latter, I will be bummed out.

I have one idea that comes to mind. Suppose functions declared inside other functions could be inlined without a backtrace entry? Obviously for this to work closures would need to be eligible for inlining. Perhaps this would offer the right amount of flexibility and control without adding any kind of syntax adornments to the language.

Top level functions always get a valid stack trace but local functions don't have the guarantee. That would work for me. Comments?

SamWhited commented 8 years ago

I ran into this issue while trying to improve the speed of bcrypt. Running something like the following benchmarks (https://bitbucket.org/snippets/SamWhited/aKK6k):

package main

import (
    "sync"
    "testing"

    "golang.org/x/crypto/bcrypt"
)

var pass = []byte(`superfly`)

const cost = 11

func BenchmarkBcrypt(b *testing.B) {
    for n := 0; n < b.N; n++ {
        crypt, _ := bcrypt.GenerateFromPassword(pass, cost)
        _ = crypt
    }
}

// One should probably generally limit CPU bound work to a handful of goroutines/threads,
// but for the purpose of example, do it the naive way and just emulate someone who has
// run their bcrypts directly in the requests without gating the max number that can run at once…
func runParallelBcrypts(num int) {
    var wg sync.WaitGroup
    wg.Add(num)
    for i := 0; i < num; i++ {
        go func() {
            crypt, _ := bcrypt.GenerateFromPassword(pass, cost)
            _ = crypt
            wg.Done()
        }()
    }
    wg.Wait()
}

func BenchmarkBcrypt10Parallel(b *testing.B) {
    for n := 0; n < b.N; n++ {
        runParallelBcrypts(10)
    }
}

func BenchmarkBcrypt100Parallel(b *testing.B) {
    for n := 0; n < b.N; n++ {
        runParallelBcrypts(100)
    }
}

resulted in a small, but noticeable (especially at higher levels of concurrency) improvement in performing lots of bcrypts (the improvement mostly appears to come from the blowfish ExpandKey function, which isn't inlined by default and makes a lot of function calls in a loop) when forcing the function to be inlined.

Numbers and graph:

https://docs.google.com/spreadsheets/d/1aUpR62qS7CJkPEO2CzNwY4R0MHFxM1WbdBJZAeRmYK8/edit?usp=sharing

the only change between benchmarks here was setting maxBudget to 1000 (which is enough to inline the ExpandKey function).

EDIT: Slightly different format for the report I put together that is self contained and doesn't fall down hard on low bandwidth connections: http://bl.ocks.org/SamWhited/ebe4f5923526c0d9220f1b5b23b56eba

gopherbot commented 6 years ago

Change https://golang.org/cl/72490 mentions this issue: cmd/compile: inline closures with captures

dgryski commented 6 years ago

rot is turned into a single instruction when the rotation value is constant: https://github.com/golang/go/issues/18254

randall77 commented 6 years ago

We also now do non-constant rotates. You need a &31 or &63 in the right place, so in practice you should use math/bits.RotateLeftXX.

golang / go

cmd/compile: enhancements to the gc inliner and optimizer #9337