golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123k stars 17.54k forks source link

cmd/compile: lay out loop-free, likeliness-free control flow more compactly #20356

Open josharian opened 7 years ago

josharian commented 7 years ago
package p

func f() int {
    x := 0
    for i := 0; i < 10; i++ {
        odd := 0
        if i%2 == 0 {
            odd = 2 // not 1, otherwise this branch gets optimized away!
        }
        x += odd
        // Distract the layout pass with a bunch of loops.
        for j := 0; j < 10; j++ {
            for j := 0; j < 10; j++ {
                for j := 0; j < 10; j++ {
                    x++
                }
            }
        }
    }
    return x
}

In this code, we have even odds of the if i%2 == 0 branch being taken. But the code layout is pretty uneven. In this CFG, b3 is that branch, and b6 and b19 are the taken/not-taken blocks; both feed into b7. It seems like a good layout for this would be b3 b6 b19 b7 or b3 b19 b6 b7. But we put b19 at the very end of the function.

This is a simplified version of something that also happens in the fannkuch benchmark. See also #20355 and #18977.

For details on how to read this image, see #20355.

noncompactlayout

cc @randall77 @dr2chase @cherrymui

Marking 1.9Maybe because we removed the old backend's instruction re-ordering pass during 1.9; this may help prevent regressions from that.

josharian commented 7 years ago

Seems like the layout pass would benefit from being made generally loop-aware.

cherrymui commented 7 years ago

Does it improve performance by laying b6 and b19 together?

Does the old follow pass put them together? Maybe I should investigate this.

What tool did you use to generate the graphs? I have done this a few times with pen and paper. A tool seems very helpful.

gopherbot commented 7 years ago

CL https://golang.org/cl/43464 mentions this issue.

josharian commented 7 years ago

Does it improve performance by laying b6 and b19 together?

I think it will, in larger functions. There are trade-offs--number of jumps, fwd vs backward, code compactness, jump encoding, etc. I need to experiment, although there are enough other layout-related things in flight (CL 43293, issue #20355), that I'd like to wait a little bit. Just filing this so I don't forget and to gather input from experienced hands.

Does the old follow pass put them together? Maybe I should investigate this.

I don't know. I'd be curious to find out.

What tool did you use to generate the graphs? I have done this a few times with pen and paper. A tool seems very helpful.

I used CL 43464, which gopherbot has helpfully linked to. It has major problems, though. Maybe I'll email golang-dev to get opinions and solicit help on it...

dr2chase commented 7 years ago

@laboger had mentioned that our likeliness/layout was not all that they had hoped.

One possibility I had intended to try was to introduce at least one more level of (un)likeliness, for things like branches to panics where we are "really sure" about likeliness, versus all the cases where we're making less-educated guesses. As a general rule we expect p(loopbackedge) > p(return) > p(panic).

josharian commented 7 years ago

Note to self: Interesting test cases for this are moderate-complexity autogenerated SSA rule functions, like rewriteValuedec_OpStore_0.

josharian commented 7 years ago

One possibility I had intended to try was to introduce at least one more level of (un)likeliness, for things like branches to panics where we are "really sure" about likeliness, versus all the cases where we're making less-educated guesses. As a general rule we expect p(loopbackedge) > p(return) > p(panic).

See CL 43293 for branches to panics; could be improved, but it is a simple and fairly effective first cut. And see the first (somewhat awful) patchset of that CL for what adding new likeliness levels looks like.

And yes, I think spending some time in 1.10 on likeliness and code layout is worthwhile.

josharian commented 7 years ago

Another observation for 1.10 use. It appears that a lot of the non-compactness I observed here may simply be due to the fact that posdegree (and zerodegree) are effectively stacks instead of queues. Though that may become irrelevant if lay out code based primarily on loop nesting information.

gopherbot commented 7 years ago

CL https://golang.org/cl/43501 mentions this issue.

josharian commented 7 years ago

I think CL 43491 (and possibly @dr2chase's follow-up) is enough to address the 1.9 regression. Punting this to 1.10.

ysmolski commented 5 years ago

I am looking at this. I was able to generate CFG using tip and the @josharian tool for the program in the topic. It's somewhat different from what was posted 1+ year ago (not a surprise). I got this:

screen shot 2018-10-12 at 15 35 47

For the full picture see the attached ssa.html: ssa.html.zip

Now I wonder how can we optimize the layout? What is the desired order of blocks in this case?

EDIT: I have uploaded improved versions of CFGs pictures.

ysmolski commented 5 years ago

For anyone who will work on this. Current version of gc does not have branching in layout pass for this code:

if i%2 == 0 {
     odd = 2
}

Layout pre/post b3 has the following code:

b3: ← b2
v21 (+10) = ADDQconst <int> [-1069] v52 (odd[int])
v57 (+8) = SHRQconst <int> [63] v7
v60 (?) = MOVQconst <int> [0]
v36 (+8) = ADDQ <int> v57 v7
v31 (+8) = SARQconst <int> [1] v36
v50 (+8) = SHLQconst <int> [1] v31
v15 (8) = CMPQ <flags> v7 v50
v23 (13) = CMOVQEQ <int> v5 v21 v15 (odd[int])
v24 (+13) = ADDQ <int> v23 v52 (x[int])
Plain → b8 (8)

It uses conditional move and thus you cannot replicate the bug with the code in the topic!

ysmolski commented 5 years ago

Code that reproduces the problem:

package p

//go:noinline
func g() {
}

func f() int {
    x := 0
    for i := 0; i < 10; i++ {
        odd := 0
        if i%2 == 0 {
            odd = 2 // not 1, otherwise this branch gets optimized away!
            g()
        }
        x += odd
        // Distract the layout pass with a bunch of loops.
        for j := 0; j < 10; j++ {
            for j := 0; j < 10; j++ {
                for j := 0; j < 10; j++ {
                    x++
                }
            }
        }
    }
    return x
}
ysmolski commented 5 years ago

@dr2chase:

One possibility I had intended to try was to introduce at least one more level of (un)likeliness, for things like branches to panics where we are "really sure" about likeliness, versus all the cases where we're making less-educated guesses. As a general rule we expect p(loopbackedge) > p(return) > p(panic).

I agree that it would solve this problem generally, in the case of i%2 == 0 we are pretty safe to assume that it's 50%/50%, while the compiler estimates that as unlikely. I am not sure why it is not the BranchUknown for this condition? Looks like the likelyadjsut pass estimates this to unlikely.

I've read in some very popular book that compiler could profile the code first to estimate likeness of branches and then make the optimization. Laughs and idealistic approaches aside, we can try to introduce what David has suggested.

@josharian what do you think?

CAFxX commented 5 years ago

I've read in some very popular book that compiler could profile the code first to estimate likeness of branches and then make the optimization. Laughs and idealistic approaches aside, we can try to introduce what David has suggested.

While we're on the topic, it seems there is no github issue for PGO, that may well be the only good way to address issues such as this or the midstack inlining one. Is it deemed completely out of scope for gc?

josharian commented 5 years ago

I’ll file a PGO issue.