CFG simplify sink performance regression on AArch64 in libcxx std::map

bmrzycki commented 8 years ago


Bugzilla Link	30452
Version	trunk
OS	Linux
CC	@hiraditya,@sebpop

Extended Description

The following commit causes an extra load inside a loop.

commit 808cdf24ff6941f8a2179abecb5c7e80a758a04a
Author: James Molloy <james.molloy@arm.com>
Date:   Sun Sep 11 09:00:03 2016 +0000

    [SimplifyCFG] Be even more conservative in SinkThenElseCodeToEnd

    This should *actually* fix llvm/llvm-project#29592 . This cranks up the workaround for llvm/llvm-project#29546  so that we never sink loads or stores of allocas.

    The idea is that these should be removed by SROA/Mem2Reg, and any movement of them may well confuse SROA or just cause unwanted code churn. It's not ideal that the midend should be crippled like this, but that unwanted churn can really cause significant regressions in important workloads (tsan).

$ cat foo.cpp
#include <map>

int test(unsigned *keys, std::map<int, int> &m_map)
{
  int i, last_index, sane=0;

  for (i=0, last_index = 0; i<100; i++)
    {
      auto it = m_map.find(keys[last_index++]);
      if (it != m_map.end())
        sane += it->second;
    }

  return sane;
}
$ clang++  foo.cpp -O3 -S -o out.s

--- good.s      2016-09-19 17:25:03.708062780 -0500
+++ bad.s       2016-09-19 17:25:26.584666253 -0500
@@ -6,7 +6,7 @@
 _Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE: // @&#8203;_Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE
 // BB#0:                                // %entry
        ldr     x9, [x1, #&#8203;8]!
-       cbz     x9, .LBB0_9
+       cbz     x9, .LBB0_11
 // BB#1:                                // %for.body.preheader
        mov      x10, xzr
        mov      w8, wzr
@@ -14,40 +14,46 @@
                                         // =>This Loop Header: Depth=1
                                         //     Child Loop BB0_3 Depth 2
        ldr     w12, [x0, x10, lsl #&#8203;2]
+       add     x10, x10, #&#8203;1            // =1
        mov      x11, x1
        mov      x13, x9
 .LBB0_3:                                // %while.body.i.i.i
                                         //   Parent Loop BB0_2 Depth=1
                                         // =>  This Inner Loop Header: Depth=2
        ldr     w14, [x13, #&#8203;28]
-       add     x15, x13, #&#8203;8            // =8
        cmp             w14, w12
-       csel    x11, x11, x13, lt
-       csel    x13, x15, x13, lt
+       b.ge    .LBB0_5
+// BB#4:                                // %if.else.i.i.i
+                                        //   in Loop: Header=BB0_3 Depth=2
+       ldr     x13, [x13, #&#8203;8]
+       cbnz    x13, .LBB0_3
+       b       .LBB0_6
+.LBB0_5:                                // %if.then.i.i.i
+                                        //   in Loop: Header=BB0_3 Depth=2
+       mov      x11, x13
        ldr             x13, [x13]
        cbnz    x13, .LBB0_3

llvmbot commented 7 years ago

Sure, please open another bug and reply the bug# here.

I've filed at llvm/llvm-bugzilla-archive#32022

hiraditya commented 7 years ago

I'm sorry, there might be some error, probably because I used libstdc++ std::map.

fa778132-f3d5-4559-b45d-fa683057d467 commented 7 years ago

Is it possible for you to explain a bit more how this is now no longer an issue? I took your code and reproduced myself and saw the same performance delta between the two versions with a recent LLVM.

James, I have also seen the degradation in performance still existing in current tip: I think that there was some error in the experiments that Aditya did.

jmolloy commented 7 years ago

Hi,

Is it possible for you to explain a bit more how this is now no longer an issue? I took your code and reproduced myself and saw the same performance delta between the two versions with a recent LLVM.

James

llvmbot commented 7 years ago

Just reverting the patch unfortunately won't do, as it was fixing another issue. Though, I also hope it can be fixed on the numba side. Thanks for pointing us to the right direction!

hiraditya commented 7 years ago

Ok, so for the apples-vs-oranges story, I suppose I better open a separate report then. Even if it was there for longer than we thought, I guess it's worth to fix it.

Sure, please open another bug and reply the bug# here. I would suggest modifying the patch in numba (if you care for performacne) because it is going to take quite some time to get this fixed, and release the compiler.

llvmbot commented 7 years ago

Ok, so for the apples-vs-oranges story, I suppose I better open a separate report then. Even if it was there for longer than we thought, I guess it's worth to fix it.

fa778132-f3d5-4559-b45d-fa683057d467 commented 7 years ago

Do we know which patch solved the problem?

Also it would be useful to add the test-case to the test-suite in order to avoid further regressions.

hiraditya commented 7 years ago

Recent analysis of the code posted by Brian show that the redundant load is not there anymore. Now the only diff with this patch is: commit 808cdf24ff6941f8a2179abecb5c7e80a758a04a Author: James Molloy james.molloy@arm.com Date: Sun Sep 11 09:00:03 2016 +0000

diff map-master.s map-808cdf~.s

// BB#7: // %_ZNSt8_Rb_treeIiSt4pairIKiiESt10_Select1stIS2_ESt4lessIiESaIS2_EE14_M_lower_boundEPSt13_Rb_tree_nodeIS2_ESBRS1 .exit.i.i cmp x13, x10 b.ne .LBB0_9 b .LBB0_11 .LBB0_8: // in Loop: Header=BB0_2 Depth=1

This does not affect performance in the test we have. We can close the bug.

hiraditya commented 7 years ago

Hi hiraditya!

The original issue I stumbled upon happened when updating LLVM from 3.8 to 3.9, but I'm not actually using clang myself, but numba. See the initial report here:

https://github.com/numba/numba/issues/2196

sklam was investigating the issue further with the apples-vs-oranges code, but unfortunately he can't join us here for the discussion yet, until the account gets activated. I agree, that apples already for 3.6 shows the performance issue. So it may not actually illustrate the initial problem, but a separate one?

It seems that the patch (https://github.com/numba/numba/commit/e03a4170fdc59a87561394ccdfa0f4abfa7ec1ac) which canonicalizes the backedge was added in numba and caused regression. Now it makes sense because when I see the IR of apples function, it would create bad code because of the structure of control flow graph. The patch is pessimizing the code. I would suggest reverting the patch if that is possible. From my analysis it would require undoing the canonicalization of that patch in the compiler to get rid of selects and enable proper vectorization i.e., splitting the back-edge into multiple back-edges (which is fairly complicated to do in the compiler)

llvmbot commented 7 years ago

Hi hiraditya!

The original issue I stumbled upon happened when updating LLVM from 3.8 to 3.9, but I'm not actually using clang myself, but numba. See the initial report here:

https://github.com/numba/numba/issues/2196

sklam was investigating the issue further with the apples-vs-oranges code, but unfortunately he can't join us here for the discussion yet, until the account gets activated. I agree, that apples already for 3.6 shows the performance issue. So it may not actually illustrate the initial problem, but a separate one?

hiraditya commented 7 years ago

Hi Michael, The issue you pointed seems to be there before the recent changes in cfg sinking. I tried clang-3.8 and got same numbers i.e. apples function is much slower than orange on x86.

Do you have a better clang compiler, please let me know.

llvmbot commented 7 years ago

About randomized data, I don't think it will make a big difference in that case, at least according to how I understood the Intel branch prediction. With amin quickly becoming reasonably small, it will only rarely receive further updates. So if the branch prediction only looks back the last 20 loops, it will see amin < a[i] as always evaluated to true for the whole relevant history, and keep predicting true. In the end even random data should have a pretty decent predictability, not that different from the array in the C code. And that's why the x86 branch predictor doesn't need extraordinary skills to explain the same timing as for random data.

About the other questions, I wish I could contribute more, but unfortunately I'm just the unlucky guy whose benchmarks suddenly took three times longer. :/

jmolloy commented 7 years ago

Hi Michael,

Sorry, I've looked more closely at all the resources you linked and I agree it sounds related to CFG sinking.

Copy-pasting the interesting bits for clarity:

; ModuleID = 'nanmin.ll' target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.11.0"

; Function Attrs: nounwind ssp uwtable define double @apple(double* %arr, i32 %size) #0 { br label %1

; :1 ; preds = %3, %0 %all_missing.0 = phi i32 [ 1, %0 ], [ %all_missing.1, %3 ] %i.0 = phi i32 [ 0, %0 ], [ %8, %3 ] %amin.0 = phi double [ 0x7FF0000000000000, %0 ], [ %amin.1, %3 ] %2 = icmp slt i32 %i.0, %size br i1 %2, label %3, label %9

; :3 ; preds = %1 %4 = sext i32 %i.0 to i64 %5 = getelementptr inbounds double, double %arr, i64 %4 %6 = load double, double %5, align 8 %7 = fcmp ole double %6, %amin.0

%all_missing.1 = select i1 %7, i32 0, i32 %all_missing.0

%amin.1 = select i1 %7, double %6, double %amin.0

%8 = add nsw i32 %i.0, 1 br label %1

; :9 ; preds = %1 %10 = icmp ne i32 %all_missing.0, 0 %.amin.0 = select i1 %10, double 0x7FF8000000000000, double %amin.0 ret double %.amin.0 }

Also importantly:

"""With gcc-4.8, it will always produce the slower performance for both functions"""

I've looked at the output, and there seem a couple of issues: 1) the apples() function produces SSE code that really doesn't look ideal to me. 2) The select is almost perfectly predictable in the case of the array as set up by main.c

I have modified main.c to randomize the array contents - that gives the same speed difference. Therefore either (2) is not the dominating problem or the x86's branch predictor is really good.

I think the most likely issue is the generated SSE code, which looks pretty terrible to me but I'm no expert. FWIW, on an AArch64 platform the two versions produce rather similar scores:

ra = 3738394.000000 | rb = 3738394.000000 apple 0.592838 orange 0.601838

Which pushes me even more towards "SSE issue" rather than the second is intrinsically faster due to branch prediction, although that could well be the case.

If it is, I'm afraid we don't have the ability to model predictability well enough to make an informed decision, so we just have to guess. We already have some guesses, such as "IsPredictableSelectExpensive", which aims to destroy selects in CodeGenPrepare, but it appears this isn't firing in this case (perhaps because it's a floating point value?)

Anyway, those are my thoughts in a rather jumbled fashion. I do think the compiler midend is currently doing the best it can with the information it has; the backend could surely do better though.

James

jmolloy commented 7 years ago

Hi Michael,

Is there a reason you attached this report to this particular bug? It doesn't seem particular relevant, unless I missed something?

James

llvmbot commented 7 years ago

We stumbled across a problem in numba caused by LLVM, which I guess may be a duplicate of this one. It looks like in 3.9 a branch is added to some loops and further optimization leads to split the loop into two seperate loops, causing severe speed regressions.

Here is some C code demonstrating and analyzing the problem:

https://gist.github.com/sklam/11f11a410258ca191e6f263262a4ea65

Here sklam already described the problem in the mailing list:

https://groups.google.com/forum/#!topic/llvm-dev/7HnK9ehPzKc

This is the initial bug report at numba:

https://github.com/numba/numba/issues/2196

Please let me know, if this is actually a duplicate, or if I should better file a separate bug report. As the speed impact is pretty severe, we'd greatly appreciate a fix for this issue. Also, unlike the title suggests, this may not be a pure arch problem.

fa778132-f3d5-4559-b45d-fa683057d467 commented 8 years ago

It sounds like the initial patch improving simplifyCFG/sink in r279460 brought the perf up and then the heuristic tweaks got the perf down back.

We have seen large changes (above 10%) in performance of some benchmarks going up with your changes to sink, and also some going down, which tells us that 1) it is an important transform that has a lot of potential, 2) it needs either better heuristics, or improvements to other passes to only positively impact the performance. Thanks again James for your work on sinking!

bmrzycki commented 8 years ago

James, if you have a few points in the development (SVN revisions) you'd like me to test just ask.

bmrzycki commented 8 years ago

Hi James, I also ran a few other tests to hopefully help characterize this problem.

First I tested a commit before your r279460. This is from about a day before:

commit 8af48927de6a79e5f19ffb19e781f007235dc8c3 Author: Justin Bogner mail@justinbogner.com Date: Wed Aug 31 23:43:14 2016 +0000

Support: Avoid errors with LLVM_FALLTHROUGH in clang 3.6 and below in C mode

I do not see the regression here.

I ran with a (near) 3.8 release compiler from March 7 and it looks like this is worse than the regressed code.

I tested a 3.9-ish compiler:

commit 1cb1a4ccc176595b34fb3c040b1a1b32310d64bc Author: Saleem Abdulrasool compnerd@compnerd.org Date: Sun Jul 17 22:54:42 2016 +0000

test: add missing triple to test

And it showed similar results to the 3.8 release.

I'll dig some more for you. this may be a case where the uplift we tested was during the active development of sinking and is a "false" regression.

jmolloy commented 8 years ago

Hi Brian,

Sorry if I wasn't clear, but I'm trying to work out if this is a regression since r279460 or not- would you mind testing with a compiler from before then?

The reason I ask is that if this s a regression from before r279460 then I have to take immediate action now, whereas if it's just from where I crippled SimplifyCFG I have another solution for that (a new pass, gvn-sink).

Cheers,

James

bmrzycki commented 8 years ago

Hi James, I just tested against a fairly recent tip from last night: https://reviews.llvm.org/rL281949

And I still see the performance regression. The diff is a bit more complex to read because the compiler is moving around the code slightly differently than my previous paste. However, we still see one extra load within the loop. Sebastian and I inspected the diff to verify.

jmolloy commented 8 years ago

Hi Brian,

Can I please confirm - is this a regression over 3.9 performance? I gave SimplifyCFG some extra powers in r279460 and then had to disable some of them in this commit. Is this a regression versus pre-r279460?

Cheers,

James

bmrzycki commented 8 years ago

I should also mention that the libcxx included was also compiled with clang/clang++ created with commit 808cdf24ff6941f8a2179abecb5c7e80a758a04a.

bmrzycki commented 8 years ago

assigned to @jmolloy

llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Brian Rzycki (bmrzycki)

| | | | --- | --- | | Bugzilla Link | [30452](https://llvm.org/bz30452) | | Version | trunk | | OS | Linux | | CC | @hiraditya,@sebpop | ## Extended Description The following commit causes an extra load inside a loop. commit 808cdf24ff6941f8a2179abecb5c7e80a758a04a Author: James Molloy <james.molloy@arm.com> Date: Sun Sep 11 09:00:03 2016 +0000 [SimplifyCFG] Be even more conservative in SinkThenElseCodeToEnd This should *actually* fix llvm/llvm-project#29592 . This cranks up the workaround for llvm/llvm-project#29546 so that we never sink loads or stores of allocas. The idea is that these should be removed by SROA/Mem2Reg, and any movement of them may well confuse SROA or just cause unwanted code churn. It's not ideal that the midend should be crippled like this, but that unwanted churn can really cause significant regressions in important workloads (tsan). $ cat foo.cpp #include <map> int test(unsigned *keys, std::map<int, int> &m_map) { int i, last_index, sane=0; for (i=0, last_index = 0; i<100; i++) { auto it = m_map.find(keys[last_index++]); if (it != m_map.end()) sane += it->second; } return sane; } $ clang++ foo.cpp -O3 -S -o out.s --- good.s 2016-09-19 17:25:03.708062780 -0500 +++ bad.s 2016-09-19 17:25:26.584666253 -0500 @@ -6,7 +6,7 @@ _Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE: // @_Z4testPjRNSt3__13mapIiiNS0_4lessIiEENS0_9allocatorINS0_4pairIKiiEEEEEE // BB#0: // %entry ldr x9, [x1, #8]! - cbz x9, .LBB0_9 + cbz x9, .LBB0_11 // BB#1: // %for.body.preheader mov x10, xzr mov w8, wzr @@ -14,40 +14,46 @@ // =>This Loop Header: Depth=1 // Child Loop BB0_3 Depth 2 ldr w12, [x0, x10, lsl #2] + add x10, x10, #1 // =1 mov x11, x1 mov x13, x9 .LBB0_3: // %while.body.i.i.i // Parent Loop BB0_2 Depth=1 // => This Inner Loop Header: Depth=2 ldr w14, [x13, #28] - add x15, x13, #8 // =8 cmp w14, w12 - csel x11, x11, x13, lt - csel x13, x15, x13, lt + b.ge .LBB0_5 +// BB#4: // %if.else.i.i.i + // in Loop: Header=BB0_3 Depth=2 + ldr x13, [x13, #8] + cbnz x13, .LBB0_3 + b .LBB0_6 +.LBB0_5: // %if.then.i.i.i + // in Loop: Header=BB0_3 Depth=2 + mov x11, x13 ldr x13, [x13] cbnz x13, .LBB0_3

llvm / llvm-project