cmd/compile: split up SSA rewrite passes

mundaym commented 6 years ago

For Go 1.12 I'd like to experiment with splitting up the rewrite rules for generic optimization, lowering and lowered optimization into more phases. This is a tracking issue for that experimentation. Ideas and feedback welcomed.

As we have added more and more optimizations the rules files have become increasingly large. This introduces a few problems. Firstly 'noopt' builds actually do a lot of optimizations, some of which are quite complex and may make programs much harder to debug. Secondly the ways that rules interact is becoming less clear. For example, adding a new set of rules might introduce 'dead ends' into the optimization passes, because, say, the author hasn't take into account special cases such as indexed versions of memory operations. Thirdly there are some rules which will only ever fire once. For example, the rules to explode small struct accesses('SSAable'). This is inefficient since they need to be re-checked every time the optimization pass is re-run (the optimization passes tend to run over and over again until a steady state is reached).

Here is a summary of the existing passes that use code generated from the rewrite rules (I've ignored the 32-bit passes for now):

opt: first round of generic optimizations [mandatory][iterative]
decompose builtins: split up compound operations, such as complex number operations, into individual operations [mandatory][single pass]
late opt: repeat the generic optimizations after CSE etc. have run [mandatory][iterative]
lower: generate and optimize machine specific operations [mandatory][iterative]

As a rough guide I see the phases looking something like this (will most likely change quite a lot):

Generic phases:

optimize initial: generic optimizations targeting compound types (store-load forwarding etc.) and constant propagation [optional][iterative]
decompose compound types: split up SSAable operations and generate Move and Zero operations etc. for non-SSAable operations [mandatory][single pass]
decompose builtins: (might merge into the previous pass) [mandatory][single pass]
optimize main: generic optimizations before the main optimization passes (CSE, BCE, etc.) [optional][iterative]
optimize final: repeat the generic optimizations after the main optimization passes [optional][iterative]

Architecture specific phases (not all architectures will need all of these):

lower: minimal rules needed to produce executable code [mandatory][iterative]
lowered optimize initial: optimizations that can be applied in one pass (not sure if this will be needed) [optional][single pass]
lowered optimize main: optimizations that need to be executed iteratively [optional][iterative]
lowered optimize final: low priority optimizations that can be applied in one pass (or a small number of passes) such as indexed memory accesses and merging loads into instructions [optional][single pass (maybe iterative)]

I'm hoping the benefits of being able to reduce the number of rules and perhaps more efficient I-cache usage will make up for the increased number of passes. I'll need to experiment to see.

Most likely some rules will need to be duplicated in multiple passes (particularly the generic optimization passes). This will probably involve splitting the rules into more files and then re-combining them in individual passes (for example, constant propagation rules could get their own file and then be called from both the initial and main generic optimization passes).

There are some TODOs along these lines in the compiler source, but I couldn't find any existing issues, so apologies if this is a dup of another issue.

randall77 commented 6 years ago

Thirdly there are some rules which will only ever fire once. For example, the rules to explode small struct accesses('SSAable'). This is inefficient since they need to be re-checked every time the optimization pass is re-run (the optimization passes tend to run over and over again until a steady state is reached).

This is true, but I wonder how much of a problem it actually is. Last time I checked (which was a while ago), all rewrite rules together were less than 10% of compile time. And splitting up rule sets increases the number of times each value gets looked at also, so splitting up isn't free either.

lower: minimal rules needed to produce executable code [mandatory][iterative]

I think we should be able to make these one pass. From the comment in AMD64.rules:

// ***************************
// Above: lowering rules
// Below: optimizations
// ***************************
// TODO: Should the optimizations be a separate pass?

Splitting lower and optimization would make the lowering one-pass and allow turning off the optimizations with -N.

lowered optimize main: optimizations that need to be executed iteratively [optional][iterative] lowered optimize final: low priority optimizations that can be applied in one pass (or a small number of passes) such as indexed memory accesses and merging loads into instructions [optional][single pass (maybe iterative)]

I think it might be tricky to separate the rules out into a fixed ordering. How do we split up the current rules? For a new rule, where do you decide to put it? How do you record the reason why? I don't want to get to a state where there are opt1.rules through opt6.rules, and there's a rule in opt4.rules which can't be moved to opt3.rules or opt5.rules, but no one knows why. There's a simplicity in putting all the rules in a single pile. This last split/distinction seems to apply mostly (only?) to the CISC architectures (x86, s390). I'm not sure it makes much sense for the other archs.

martisch commented 6 years ago

I think it would be worth looking at least at separating out a final one pass ruleset for some very low level transformations. For example prefer this instruction over another one for these range of parameters. e.g. golang.org/cl/115997, MOV 0 -> XOR, maybe LEA 3arg -> 2x LEA2arg, ... that way they will not interfere with more complex optimizations and may reduce complexity in the main ruleset.

dr2chase commented 6 years ago

I don't see anything like a plan to do this for 1.12, and I am deferring to 1.13.

gopherbot commented 5 years ago

Change https://golang.org/cl/201978 mentions this issue: cmd/compile/internal/ssa: split lowered optimization passes

golang / go

cmd/compile: split up SSA rewrite passes #27034