llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.26k stars 11.67k forks source link

[reassoc] "Reassociate expressions" pass optimizations not always profitable #62736

Closed pawosm-arm closed 8 months ago

pawosm-arm commented 1 year ago

Consider the following piece of code:

void innermost_loop(int i, double d1, double d2, double delta, int n, double cells[n])
{
  int j;
  const double d1d = d1 * delta;
  const double d2d = d2 * delta;

  for (j = 0; j <= i; j++)
    cells[j] = d1d * cells[j + 1] + d2d * cells[j];
}

When compiling at -Ofast level, after the "Reassociate expressions" pass, this code is transformed into an equivalent of:

  int j;

  for (j = 0; j <= i; j++)
    cells[j] = (d1 * cells[j + 1] + d2 * cells[j]) * delta;

Effectively, the computation of those loop invariants isn't done before the loop anymore, we have one extra multiplication on each loop iteration instead. The increased performance cost measured on AArch64 is significant (a test code built with the compiler having the "Reassociate expressions" pass manually disabled executes faster, namely with this pass enabled the same test code executes 30% slower); the problem is also worsen by lost optimization opportunities at codegen when targeting SVE.

The transformation happens in the ReassociatePass::OptimizeAdd() function and the only logic that could prevent it, is this:

  // If any factor occurred more than one time, we can pull it out.
  if (MaxOcc > 1) {
    LLVM_DEBUG(dbgs() << "\nFACTORING [" << MaxOcc << "]: " << *MaxOccVal
                      << '\n');

In this case, MaxOcc is 2 and MaxOccVal is delta.

I suspect this logic is too simplistic. I guess there should be some cost model involved where expansions of loop invariant computation expressions could have the cost high enough to prevent them from being transformed by this optimization. Alternatively, the LoopInfo analysis could be used to check whether the transformed expression is inside a loop and is invariant in that loop. If yes, such expression would not be modified.

Note that GCC 12 does not do such transformation at -Ofast hence not causing a performance drop. The performance delta was the primary reason for finding this problem. I've also found a 2013 paper [1] which states the following about this optimization profitability:

Figure 3 shows the performance distributions (density) generated by activating and deactivating the ’licm’ and ’reassociate’ compiler transformations for a GSM codec application. It can be observed that while the activation of ’licm’ has a clear positive effect on performance – the median is shifted towards higher performance values. This is not the case for the ’reassociate’ transformation, since the activation and deactivation distributions have almost the same shape and density, thus not permitting the designer to recognize a clear trend.

[1] http://amirashouri.ca/resources/vlsi-soc_Ashouri2013.pdf

nikic commented 1 year ago

Does https://reviews.llvm.org/D147457 fix it?

pawosm-arm commented 1 year ago

Ah, it's not merged, I'm using SHA d6bd4ea35437b1d39933e9526779e8c6e03125e0 and it's definitely not there. I'll give it a run.

pawosm-arm commented 1 year ago

Hi @nikic I did give that patch a go, sadly it didn't change anything, no matter if I set -reassociate-use-cse-local= explicitly or not. Which is not what I expected (briefly looking at that patch), in the end, the computation is in entry: basic block, while the use is in for.body:. After this optimization, the only thing that is left in entry: is an unconditional jump to for.cond.

qcolombet commented 1 year ago

https://reviews.llvm.org/D147457 is indeed not going to help, it solves a different problem. https://reviews.llvm.org/D147457 is about fixing the order of emission for what has been reassociated so that different reassociated expressions have higher chances of exposing CSE opportunities, it doesn't change the reassociate heuristic itself.

In this particular case, there is only one reassociation expression (what defines cells[j]), hence why https://reviews.llvm.org/D147457 does not apply.

Going back to what @pawosm-arm mentioned, the profitability metric for the reassociation is indeed too crude to capture this case. This model is only accurate for things that are within the same basic block or more generally have the same execution frequency. We would need to inject this information somewhere in the model. Unfortunately, I don't have much time to put here.

qcolombet commented 1 year ago

BTW, to make things a little easier for the next person looking at this, the IR reproducer is:

define dso_local void @innermost_loop(i32 noundef %arg, double noundef %arg1, double noundef %arg2, double noundef %arg3, i32 noundef %arg4, ptr nocapture noundef %arg5) local_unnamed_addr {
bb:
  %i = fmul fast double %arg1, %arg3
  %i6 = fmul fast double %arg2, %arg3
  br label %bb7

bb7:                                              ; preds = %bb10, %bb
  %i8 = phi i32 [ 0, %bb ], [ %i11, %bb10 ]
  %i9 = icmp sgt i32 %i8, %arg
  br i1 %i9, label %bb21, label %bb10

bb10:                                             ; preds = %bb7
  %i11 = add nuw nsw i32 %i8, 1
  %i12 = zext i32 %i11 to i64
  %i13 = getelementptr inbounds double, ptr %arg5, i64 %i12
  %i14 = load double, ptr %i13, align 8
  %i15 = fmul fast double %i, %i14
  %i16 = zext i32 %i8 to i64
  %i17 = getelementptr inbounds double, ptr %arg5, i64 %i16
  %i18 = load double, ptr %i17, align 8
  %i19 = fmul fast double %i6, %i18
  %i20 = fadd fast double %i15, %i19
  store double %i20, ptr %i17, align 8
  br label %bb7

bb21:                                             ; preds = %bb7
  ret void
}

Reproducer:

opt -passes=reassociate tmp.ll -S -o -
pawosm-arm commented 1 year ago

Is there a good reason why "Reassociate expressions" pass is executed before LICM? Maybe all that is needed is to ensure different execution order of those passes.

qcolombet commented 1 year ago

Is there a good reason why "Reassociate expressions" pass is executed before LICM? Maybe all that is needed is to ensure different execution order of those passes.

What would this solve?

pawosm-arm commented 1 year ago

Nah nothing, didn't have my morning coffee yet.

pawosm-arm commented 8 months ago

Commit has been merged.