cmd/compile: reorganize associative computation to allow superscalar execution

josharian commented 3 years ago

The compiler currently compiles a+b+c+d as a+(b+(c+d)). It should use (a+b)+(c+d) instead, because the latter can be executed out of order.

More broadly, we should balance trees of associative computation.

Doing this in the compiler with a rule tailored for a single computation type in round 4 of md5block.go yielded a 15% throughput improvement. (See below for details.)

It's not obvious to me whether we can do this with carefully crafted rewrite rules or whether a dedicated pass would be better. But it looks like there may be significant performance wins available on tight mathematical code.

I don't plan to work on this further, but I really hope someone else picks it up.

cc @randall77 @martisch @FiloSottile @mmcloughlin @mdempsky

To reproduce the md5 results, disable the optimized assembly routines, and add this rewrite rule:

(Add32 <t> c:(Const32) (Add32 d:(Load _ _) (Add32 x:(Xor32 _ _) a:(Add32 _ _)))) => (Add32 (Add32 <t> c d) (Add32 <t> x a))

and disable this one to avoid an infinite loop:

(Add32 (Add32 i:(Const32 <t>) z) x) && (z.Op != OpConst32 && x.Op != OpConst32) => (Add32 i (Add32 <t> z x))

seebs commented 3 years ago

    a := float64(1<<53)
    b := float64(1)
    c := float64(1)
    d := float64(0)
    e := (a+(b+(c+d)))
    f := (a+b)+(c+d)

It's probably good for integers, because I think that, with the guarantees Go gives, addition is associative in Go, but for float values, it isn't always.

josharian commented 3 years ago

@seebs the SSA backend has different ops for different data types exactly to avoid this kind of issue. (We have @dr2chase to thank for that.) Relatedly this kind of optimization is also not appropriate for adding pointers, as intermediate results may be invalid pointers.

mdempsky commented 3 years ago

Relatedly this kind of optimization is also not appropriate for adding pointers, as intermediate results may be invalid pointers.

Note that it is safe (AFAICT) to rewrite (ptr + x1) + x2 into ptr + (x1 + x2), just not the other way around. I'm skeptical this is useful in practice though.

gopherbot commented 1 year ago

Change https://go.dev/cl/493115 mentions this issue: cmd/compile: add reassociate pass to rebalance commutative operations

y1yang0 commented 1 year ago

Interesting. As far as I know, reassociate in other compilers is mainly to assist other optimizations (such as sccp, gcse, loopopt). It seems that reassociate is used to "accelerate CPU superscalar execution" is a new expressive statement, is there any related extension information/paper? Maybe this is also helpful for other compilers. Thanks.

Java applies reassociate only when invariant participates computation in loop. GCC trunk does not reassociate a+b+c+d to (a+b)+(c+d). Clang trunk does reassociates it.

CAFxX commented 1 year ago

is there any related extension information/paper?

@y1yang0 see section 9.5 (and in general the whole of chapter 9) of https://www.agner.org/optimize/optimizing_assembly.pdf

ryan-berger commented 1 year ago

@y1yang0 Yes, Clang outputs the optimization because of the Reassociate pass in LLVM.

This is an extremely simple analysis pass compared to LLVM's version, and some of the things the LLVM pass are already taken care of by the rewrite rules which I didn't know existed until after I submitted the PR so I've simplified this even further to get rid of constant sorting.

If this pass runs again further after opt it might have constant folding opportunities but it is more important to get done before gcse since it can help group expressions together nicely. For now the goal is to just get out of order execution accelerated and I'll work on some more of the opportunities that this optimization opens up soon.

It looks like the reason Java does a reassociation optimization only inside of loops is auto-vectorization. This sort of pass helps sort out all the dependencies and lets you pretty easily recognize 4 consecutive additions that could be turned into some SIMD if a model decides it is cost effective.

This pass seems to help make analysis much easier in a lot of places though so I'll be combing through some other compilers and LLVM to see if we can leverage it more in Go

gopherbot commented 1 year ago

Change https://go.dev/cl/496095 mentions this issue: cmd/compile: add ilp pass to help balance commutative expressions aiding in instruction level parallelism

golang / go

cmd/compile: reorganize associative computation to allow superscalar execution #49331