[SCEV][IndVarSimplification] Run time regression in Embench/primecount benchmark

martong commented 10 months ago

There is a run time regression in Embench's primecount benchmark since LLVM release 15.0. I have found with git bisect that the offending change is https://reviews.llvm.org/D129710 .

I have made the measurements with -target riscv32 -march=rv32imc -mabi=ilp32 and I've been using an instruction accurate simulator which handles all instruction as taking exactly one cycle. The number of cycles is ~50% more in LLVM 15.0 than in previous releases.

Primecount counts the number of primes up to a certain constant, it contains a loop nest of depth 3 and there are gotos to break out from the middle loop.

I could reduce the initial code (linked above) to this simplest case where there is still some meaningful difference in the run time:

int a[2];
int b() {
  int c[2];
  int d = 0;
  a[0] = c[0] = ++d;
  while (1) {
    for (int e = 0; e < d; ++e) {
      if (a[e] > 2)
        goto f;
      while (c[e] < 3)
        c[e] += a[e];
    }
    break;
  f:
    ++d;
  }
  return 3;
}

which is equivalent to the below code if we get rid of the goto:

int a[2];
int b() {
  int c[2];
  int d = 0;
  int exit_outer_loop = 0;
  a[0] = c[0] = ++d;
  while (!exit_outer_loop) {
    for (int e = 0; e < d; ++e) {
      if (a[e] > 2) {
        ++d;
        exit_outer_loop = 0; // ensures the outer loop runs again
        break; // exits the inner loop
      }
      while (c[e] < 3)
        c[e] += a[e];
      exit_outer_loop = 1; // ensures the outer loop terminates if inner loop completes
    }
  }
  return 3;
}

What happens here is that the offending commit enables an optimization of the induction variable c[e] in the innermost loop. During rewriteLoopExitValues we move the calculation of the final exit value of the IV into the loop's preheader.

; Preheader:
while.cond3.preheader:                            ; preds = %for.body
  %arrayidx4 = getelementptr inbounds [2 x i32], ptr %c, i32 0, i32 %e.011
  %arrayidx4.promoted = load i32, ptr %arrayidx4, align 4, !tbaa !4
  br label %while.cond3

...

*** IR Dump After IndVarSimplifyPass on while.cond3 ***

; Preheader:
while.cond3.preheader:                            ; preds = %for.body
  %arrayidx4 = getelementptr inbounds [2 x i32], ptr %c, i32 0, i32 %e.011
  %arrayidx4.promoted = load i32, ptr %arrayidx4, align 4, !tbaa !4
  %smax = call i32 @llvm.smax.i32(i32 %arrayidx4.promoted, i32 3)
  %1 = sub i32 %smax, %arrayidx4.promoted
  %umin = call i32 @llvm.umin.i32(i32 %1, i32 1)
  %2 = sub i32 %1, %umin
  %umax = call i32 @llvm.umax.i32(i32 %0, i32 1)
  %3 = udiv i32 %2, %umax
  %4 = add i32 %umin, %3
  %5 = mul i32 %0, %4
  br label %while.cond3

This was not happening before the offending commit because the SCEV expression that divides with 1 umax %0 was considered unsafe to expand (isSafeToExpand). Now, after the offending commit, in rewriteLoopExitValues we discover that this expansion has a high cost (isHighCostExpansion), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness.

I was thinking about as a possible solution, to consider the product of the trip counts of the containing loops and compare that to the gain we have with the expansion. More formally, do the expansion only if I * J * X < I * J * X', where I and J are the trip count of the outermost and middle loops. X is the number of all operations before expansion in the innermost loop, X' is the same, but after the expansion; this needs to calculate with the trip count of the innermost loop. The problem is that SCEV is not able to infer the loop trip count (backedge taken count neither) for neither of these loops. This is perhaps because the induction variable of the innermost loop is dependent on the IV of an outer loop. Any pointers on SCEV's limitation could be helpful at this time. I was also experimenting to use Attributor's AAPotentialValues to see if the dataflow framework could discover the value of c[e], unfortunately it could not. (It reached the top.)

Any feed-back is more than welcome. I've assigned people with history in the related components, my intention was solely to draw their attention. Other than this, perhaps discussing the issue and possible fix would be more appropriate to be done in discourse and an RFC?

llvmbot commented 10 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Gabor Marton (martong)

There is a run time regression in [Embench](https://github.com/embench/embench-iot)'s primecount benchmark since LLVM release 15.0. I have found with git bisect that the offending change is https://reviews.llvm.org/D129710 . I have made the measurements with `-target riscv32 -march=rv32imc -mabi=ilp32` and I've been using an instruction accurate simulator which handles all instruction as taking exactly one cycle. The number of cycles is ~50% more in LLVM 15.0 than in previous releases. [Primecount](https://github.com/embench/embench-iot/blob/4c5eb87983f51ca7fcf7855306877b3d1c3aabf1/src/primecount/primecount.c#L106) counts the number of primes up to a certain constant, it contains a _**loop nest**_ of depth 3 and there are `goto`s to break out from the middle loop. I could reduce the initial code (linked above) to this simplest case where there is still some meaningful difference in the run time: ``` int a[2]; int b() { int c[2]; int d = 0; a[0] = c[0] = ++d; while (1) { for (int e = 0; e < d; ++e) { if (a[e] > 2) goto f; while (c[e] < 3) c[e] += a[e]; } break; f: ++d; } return 3; } ``` which is equivalent to the below code if we get rid of the `goto`: ``` int a[2]; int b() { int c[2]; int d = 0; int exit_outer_loop = 0; a[0] = c[0] = ++d; while (!exit_outer_loop) { for (int e = 0; e < d; ++e) { if (a[e] > 2) { ++d; exit_outer_loop = 0; // ensures the outer loop runs again break; // exits the inner loop } while (c[e] < 3) c[e] += a[e]; exit_outer_loop = 1; // ensures the outer loop terminates if inner loop completes } } return 3; } ``` What happens here is that the offending commit enables an optimization of the induction variable `c[e]` in the innermost loop. During `rewriteLoopExitValues` we move the calculation of the final exit value of the IV into the loop's preheader. ``` ; Preheader: while.cond3.preheader: ; preds = %for.body %arrayidx4 = getelementptr inbounds [2 x i32], ptr %c, i32 0, i32 %e.011 %arrayidx4.promoted = load i32, ptr %arrayidx4, align 4, !tbaa !4 br label %while.cond3 ... *** IR Dump After IndVarSimplifyPass on while.cond3 *** ; Preheader: while.cond3.preheader: ; preds = %for.body %arrayidx4 = getelementptr inbounds [2 x i32], ptr %c, i32 0, i32 %e.011 %arrayidx4.promoted = load i32, ptr %arrayidx4, align 4, !tbaa !4 %smax = call i32 @llvm.smax.i32(i32 %arrayidx4.promoted, i32 3) %1 = sub i32 %smax, %arrayidx4.promoted %umin = call i32 @llvm.umin.i32(i32 %1, i32 1) %2 = sub i32 %1, %umin %umax = call i32 @llvm.umax.i32(i32 %0, i32 1) %3 = udiv i32 %2, %umax %4 = add i32 %umin, %3 %5 = mul i32 %0, %4 br label %while.cond3 ``` This was not happening before the offending commit because the SCEV expression that divides with `1 umax %0` was considered unsafe to expand (`isSafeToExpand`). Now, after the offending commit, in `rewriteLoopExitValues` we discover that this expansion has a high cost (`isHighCostExpansion`), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness. I was thinking about as a possible solution, to consider the product of the trip counts of the containing loops and compare that to the gain we have with the expansion. More formally, do the expansion only if `I * J * X < I * J * X'`, where `I` and `J` are the trip count of the outermost and middle loops. `X` is the number of all operations before expansion in the innermost loop, `X'` is the same, but after the expansion; this needs to calculate with the trip count of the innermost loop. The problem is that SCEV is not able to infer the loop trip count (backedge taken count neither) for neither of these loops. This is perhaps because the induction variable of the innermost loop is dependent on the IV of an outer loop. Any pointers on SCEV's limitation could be helpful at this time. I was also experimenting to use `Attributor`'s `AAPotentialValues` to see if the dataflow framework could discover the value of `c[e]`, unfortunately it could not. (It reached the top.) Any feed-back is more than welcome. I've assigned people with history in the related components, my intention was solely to draw their attention. Other than this, perhaps discussing the issue and possible fix would be more appropriate to be done in discourse and an RFC?

efriedma-quic commented 10 months ago

Now, after the offending commit, in rewriteLoopExitValues we discover that this expansion has a high cost (isHighCostExpansion), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness.

I'm having a little trouble following this. Is the "weakness" here just that deleting a loop might not be profitable if it means rewriting expensive values? Or is there something else going on here?

martong commented 10 months ago

Now, after the offending commit, in rewriteLoopExitValues we discover that this expansion has a high cost (isHighCostExpansion), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness.

I'm having a little trouble following this. Is the "weakness" here just that deleting a loop might not be profitable if it means rewriting expensive values? Or is there something else going on here?

No, it is something else. I am trying to explain with other words what I mean under "weakness": Normally, it is a good heuristic to delete a loop even at a high cost of expansion. The problem is that we do delete the innermost loop, but the overall number of instructions in the whole loop nest increases.

So, indeed it is not profitable to delete the innermost loop in this case. Regular loop optimizations work on loops in isolation and they start with the innermost loop (and then go outward). If the pass took into consideration the whole loop nest, then we could have a better optimized code. And here comes my idea to consider the trip counts of each loop in the nest and the cost of the expansion. But, then there is the problem that the CFG seems to be too complex for SCEV to be able to infer the trip counts.

martong commented 10 months ago

@javedabsar @javedabsar1 Javed, I've found your presentation about SCEV, you might find this interesting

llvm / llvm-project

[SCEV][IndVarSimplification] Run time regression in Embench/primecount benchmark #71424