Open tildeleb opened 9 years ago
gc does recognize constant rotates. It's not done by the inliner, it's recognized in the walkrotate() function in cmd/gc/walk.c For unconstrained rotates, it does need to do a lot additional checks. Blindly using the rotate instruction might not what the user wants.
Fortunately, most modern crypto/hashing algorithms use constant rotates, so i don't think recognizing variable rotate matters that much.
I agree that inlining of function closure that doesn't escape will be very beneficial. I certainly used it a lot.
@minux Thanks for the info about the rotates, much appreciated. I think the comment in walk.c says what I was trying to say.
// TODO: Could allow s and 32-s if s is bounded (maybe s&31 and 32-s&31).
Still I think it bears repeating that I saw it a comment somewhere that GCC recognizes this
func rot(x, k uint32) uint32 { return x << k | x >> (32 - k) }
as the idiom, without the &31
. I haven't verified that yet, but I will at some point.
I guess I can get what I want today for the ROT instruction if I am willing to flush the function call to rot
and expand the shifts by hand, but at some loss of clarity as to what is going on in the algorithm.
So a ^= rot(c, 4)
becomes a ^= c<<4 | c>>28
Could the compiler do the &31 or &63 for the shift/rotate amounts for you?
I expect the deficiencies in the current gc compiler will go away when they are rewritten into Go and another target-independent middle end is introduced.
One of my points was a lot could be gained from the current compiler by increasing hairiness
and level
and adding command line flags to set them. That is not a large request. We are talking about less than 10 lines of code.
I would like to see this added to 1.5. It makes sense (to me) to do it at the same time as the rewrite.
We want the rewrite to be purely mechanical. Let's revisit this after the rewrite is complete.
FYI, you can use -l=5 instead of repeating -l 5 times.
cc @randall77
How much do I have to pay to get just a switch set hairiness
? I'd settle for that in 1.5. It would affect the rewrite at all not any automated testing. You can leave hairiness
== 40.
In return I'd have a much easier time getting speedups in Go for some hash functions.
My kingdom for a switch.
Command-line flags are tricky. The problem is "go get". How do you tell it what flags you want? What if two packages you depend on want different flags? Incompatible flags? The ecosystem is much nicer if there is just one compiler for everything.
The current compiler does have some flags (like -l), but they are for debugging the compiler only, not for any production use.
We can certainly revisit the built-in settings for hariness and level.
As I said above I understand the desire to eschew all flags. On the other hand optimizers often have bugs and I have found the ability to have some control over them vital to my productivity over the years. e.g. I would have been blocked but I could proceed with the optimizer turned off.
You point about go get and packages it well taken and true.
My opinion is you will have a very hard time finding universal constants for hariness and level and people like me will be frustrated with your choice because you end up saying, "geez if hariness was just 1o more this would inline" and ditto for level. Or there is some onscure case that's broken for level n and the expediant fix is to decrement level by one until some other issue is fixed.
Here is a proposal.
Could source files contain a comment ala build tags that describes what kind of optimizations and other flags are needed? The benefits are probably obvious but I'll state them to be complete.
"Comment control" was just added for "go generate" so there is a precendent.
No, thanks. Let's not introduce more magic comments. The compiler should do the right thing by default.
//go:generate is not for a precedent for this, because the actual Go compiler doesn't recognize them. It's the Go command that processes them.
-N
should disable compiler optimisations.
linux-amd64-noopt
builder at http://build.golang.org/ to make sure the noopt mode continues to work. We've had it break a couple times in the past when nobody is paying attention. Now the builder forces us to pay attention.Turning off optimization is easy, -gcflags "-N -l" will do.
Extreme inlining will bloat the binary, we need to do more investigation to weigh the trade-offs. (And I have to say Go binaries are already pretty large. I'm willing to go so far as to do some compression to reduce the Go binary size. We definitely shouldn't just blindly increase the level and hairiness)
Also, just because a function could be inlined, that doesn't mean we should inline it. GCC has fairly complex heuristics regarding when to inline and when not to inline.
PS: any non-leaf function will likely remain to be not inlined, otherwise that will change the stack trace, and we value debuggability better than raw performance there.
I'll summarize the issues and requests below and sign off on this one. I apologize if this issue wandered around a bit. Everything here is related to the inliner and optimization.
Must upsetting to me is the perception that I am complaining. I think frustrated would be better word. With a little effort the Go inliner could be much more effective. It's low hanging fruit.
PS: any non-leaf function will likely remain to be not inlined, otherwise that will change the stack trace, and we value debuggability better than raw performance there.
That's work for debug info. A single PC can be associated with a whole stack of inlined functions, and DWARF supports this. That will improve current situation as well, because currently we miss frames in CPU profiles and during nil derefs in inlined functions.
On Sun, Jan 11, 2015 at 3:50 AM, Dmitry Vyukov notifications@github.com wrote:
PS: any non-leaf function will likely remain to be not inlined, otherwise that will change the stack trace, and we value debuggability better than raw performance there.
That's work for debug info. A single PC can be associated with a whole stack of inlined functions, and DWARF supports this. That will improve current situation as well, because currently we miss frames in CPU profiles and during nil derefs in inlined functions.
DWARF supports everything. But we certainly don't want to base our backtrace support on DWARF.
If we support showing backtraces for inlined functions, ideally I still want to see the argument values dump like we do now, but on the other hand, that's close to impossible to implement without invalidating all the benefit of inlining.
It's easy to do it our own traceback format. Yes, arguments of inlined functions will be missing.
Sorry I was gone for awhile. Thanks to @rsc for keeping this one open.
A few comments:
What can I do to help?
Correct backtraces pretty much is a prerequisite, because things like the standard library's log package rely on being able to count the number of frames they get from runtime.Callers. These can't be casually broken. We might be able to use some different approach, but that would have to be invented, proposed, planned, and implemented. Or, we could implement inlining such that backtraces work.
IDK. I always thought that inlining and a stack frame were antithetical. From my perspective if function A is inlined into B then function A ceases to exist as a function. Function A becomes more like a textual hygienic macro that was expanded. Do you agree with that?
The whole idea of inlining is twofold:
I guess the question is what's the overhead for maintaining a backtrace without a stack frame or will an inlined function require a stack frame just for debugging? If the latter, I will be bummed out.
I have one idea that comes to mind. Suppose functions declared inside other functions could be inlined without a backtrace entry? Obviously for this to work closures would need to be eligible for inlining. Perhaps this would offer the right amount of flexibility and control without adding any kind of syntax adornments to the language.
Top level functions always get a valid stack trace but local functions don't have the guarantee. That would work for me. Comments?
I ran into this issue while trying to improve the speed of bcrypt. Running something like the following benchmarks (https://bitbucket.org/snippets/SamWhited/aKK6k):
package main
import (
"sync"
"testing"
"golang.org/x/crypto/bcrypt"
)
var pass = []byte(`superfly`)
const cost = 11
func BenchmarkBcrypt(b *testing.B) {
for n := 0; n < b.N; n++ {
crypt, _ := bcrypt.GenerateFromPassword(pass, cost)
_ = crypt
}
}
// One should probably generally limit CPU bound work to a handful of goroutines/threads,
// but for the purpose of example, do it the naive way and just emulate someone who has
// run their bcrypts directly in the requests without gating the max number that can run at once…
func runParallelBcrypts(num int) {
var wg sync.WaitGroup
wg.Add(num)
for i := 0; i < num; i++ {
go func() {
crypt, _ := bcrypt.GenerateFromPassword(pass, cost)
_ = crypt
wg.Done()
}()
}
wg.Wait()
}
func BenchmarkBcrypt10Parallel(b *testing.B) {
for n := 0; n < b.N; n++ {
runParallelBcrypts(10)
}
}
func BenchmarkBcrypt100Parallel(b *testing.B) {
for n := 0; n < b.N; n++ {
runParallelBcrypts(100)
}
}
resulted in a small, but noticeable (especially at higher levels of concurrency) improvement in performing lots of bcrypts (the improvement mostly appears to come from the blowfish ExpandKey
function, which isn't inlined by default and makes a lot of function calls in a loop) when forcing the function to be inlined.
Numbers and graph:
https://docs.google.com/spreadsheets/d/1aUpR62qS7CJkPEO2CzNwY4R0MHFxM1WbdBJZAeRmYK8/edit?usp=sharing
the only change between benchmarks here was setting maxBudget to 1000 (which is enough to inline the ExpandKey
function).
EDIT: Slightly different format for the report I put together that is self contained and doesn't fall down hard on low bandwidth connections: http://bl.ocks.org/SamWhited/ebe4f5923526c0d9220f1b5b23b56eba
Change https://golang.org/cl/72490 mentions this issue: cmd/compile: inline closures with captures
rot
is turned into a single instruction when the rotation value is constant: https://github.com/golang/go/issues/18254
We also now do non-constant rotates. You need a &31 or &63 in the right place, so in practice you should use math/bits.RotateLeftXX
.
Background I recently tried optimizing some hash functions in Go to find out what is possible without resorting to using assembler. I knew gc had an "inliner" but I didn't have any experience with it. I spent some time reading the "inl.c" code.
Summary Go's inliner seems to work well and I got speed increase of 30% on some hash functions using it. However, the way it is currently configured and controlled effectively turns if off for most cases. Most users won't figure out how to set the
level
(-gcflags="-l -l -l -l -l -l -m") and even less will be willing to recompile the tool chain to change the hard codedhairiness
variable set to more than 40. The inliner would be more effective if it could handle closures and variadic functions. Finally, a simple idiom recognizer would allow a simple function such asrot()
to be turned into a single instruction. Taken together these changes could yield significant performance improvements to certain kinds of Go packages.Here are my impressions, suggestions, and requests:
A simple code example is here, it's just one one Jenkin's Hash functions: https://play.golang.org/p/rmDG7MR5F9 This comes from: https://github.com/tildeleb/hashland/blob/master/jenkins/jenkins.go The main project is: https://github.com/tildeleb/hashland/
Observations and Impressions
go build -gcflags="-l -l -l -l -l -l -m"
hairiness
. It is current set to 40 and that's not very hairy. In the example above. I had to breakmix()
andfinal()
into 3 functions of two lines each, to get them to inline. 40 really seems like a very small budget to me. Hard coding a parameter like this is just wrong.mix()
andfinal()
which close over the hash state variables a, b, and c need to moved out of the function "HashWords". Since the state variables a, b, and c are lo longer closed over they need to be passed to and returned from themix()
andfinal() functions
. For example:a, b, c = mix(a, b, c)
. If all hash functions had 3 state variables that might be OK.However, other more advanced hash functions have more state. For example in Jenkin's SpookyV2 I had to write:
h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11 = Mix(inul, h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11)
Which I hope we can all agree is out of control and is why I believe why closures are the preferred solution to many inlining based use cases.
The downside of the closures is that if you have multiple versions of a hash function that use the same mix() and final() code, those closures need to be duplicated inside the top level functions.
Again, I prefer the closures. The source code is cleaner, easier to read, and probably easier to optimize because the optimizer doesn't need to optimize out the calling and return stack push/pop sequences to get good code quality.
rot(x, k)
gets inlined but it doesn't generate a rot insturction. Other compilers are able to do this. Actually, I think a case can be made to add a rotate (no carry) operator (>>>
or<<<
) to the language for this particular case, but that's another "issue" as they say. BTW Go generates about 22 instructions to get the rotate done.Variadic functions aren't inlined. I am considering proposing the following as a new interface for hash functions that would be complementary to the current hash streaming interface.
Hash32(b []byte, seeds ...uint32) uint32
Hash64(b []byte, seeds ...uint64) uint64
Hash128(b []byte, seeds ...uint64) (uint64, uint64)
Hash(in []byte, out []byte, seeds ...uint64) []byte
Because of this I would like to see inlining of variadic functions.
FWIW I still have to benchmark this vs other alternative calling sequences.
Suggestions
hairiness
andlevel.
I know the Go team eschews command line flags but some control over the optimization process has been around since the unix epoch and is probably inevitable.For example:
% go build -ilbudget=120 illevel=6
rot
in the example above can be turned into single instructions, really (2-3 instructions counting theMOVQ
to the registers instead of the 20+ now generated to get a rot.