llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.83k stars 11.46k forks source link

[InstCombine] canonicalize IR to funnel shift intrinsics #36765

Open rotateright opened 6 years ago

rotateright commented 6 years ago
Bugzilla Link 37417
Version trunk
OS All
Depends On llvm/llvm-project#46240
CC @LebedevRI,@RKSimon

Extended Description

Forking this off from bug 37387.

LICM (I think) may split a rotate pattern in IR across blocks. Once that happens, DAGCombiner::MatchRotate() can't put the pieces back together again. CodeGenPrepare must reverse LICM to make the rotate visible to the DAG when the target has a legal/custom ROTL/ROTR node.

Here's a basic example (should verify that the unrolled case works too):

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

$ ./clang -O2 ror.c -S -o - -fno-unroll-loops -emit-llvm

define void @rotateInLoop(i32* nocapture %x, i32 %b, i32* nocapture readonly %a, i32 %N) {
entry:
  %cmp12 = icmp eq i32 %N, 0
  br i1 %cmp12, label %for.cond.cleanup, label %for.body.lr.ph

for.body.lr.ph: 
  %sub = sub nsw i32 32, %b  <--- this should be moved back into the loop
  %wide.trip.count = zext i32 %N to i64
  br label %for.body

for.cond.cleanup:     
  ret void

for.body:
  %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !3
  %shr = lshr i32 %0, %b
  %shl = shl i32 %0, %sub
  %or = or i32 %shr, %shl
  %idxprom3 = zext i32 %or to i64
  %arrayidx4 = getelementptr inbounds i32, i32* %x, i64 %idxprom3
  %1 = trunc i64 %indvars.iv to i32
  store i32 %1, i32* %arrayidx4, align 4, !tbaa !3
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond, label %for.cond.cleanup, label %for.body
}
RKSimon commented 2 years ago

mentioned in issue llvm/llvm-project#46240

LebedevRI commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#42874

Trass3r commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#39624

rotateright commented 5 years ago

These are the canonicalizations we have so far:

  1. UB-safe shift+logic, matching types https://reviews.llvm.org/rL350672

  2. Various flavors of shift+logic with mismatched types https://reviews.llvm.org/rL350419

  3. Semi-UB-safe shift+logic+select, matching types https://reviews.llvm.org/rL350199

  4. Semi-UB-safe shift+logic with guard branch, matching types: https://reviews.llvm.org/rL349396

rotateright commented 5 years ago

Copying the IR from the godbolt link here for reference. I'm guessing there are a number of variations if the shift amount is not the same width as the other variables (similar to the simpler rotate patterns). This is going to take some large pattern matching.

Any help in writing tests/patches is appreciated. I'm not sure when I'll get to these. The tricky part is that we can't canonicalize simpler UB-unsafe sequences to the intrinsics because the DAG expansion might be worse perf. This is discussed here: https://reviews.llvm.org/D55604?id=177858#inline-492396 and bug 34924.

define i32 @fshl_pow2_bitwidth_using_wide_type(i32 %a, i32 %b, i32 %c) {
  %widea = zext i32 %a to i64
  %wideb = zext i32 %b to i64
  %high = shl nuw i64 %widea, 32
  %wide = or i64 %high, %wideb
  %cmodwidth = and i32 %c, 31
  %cmodwidthwide = zext i32 %cmodwidth to i64
  %shifted = shl i64 %wide, %cmodwidthwide
  %skiplowhalf = lshr i64 %shifted, 32
  %trunc = trunc i64 %skiplowhalf to i32
  ret i32 %trunc
}

define i32 @fshr_pow2_bitwidth_using_wide_type(i32 %a, i32 %b, i32 %c) {
  %widea = zext i32 %a to i64
  %wideb = zext i32 %b to i64
  %high = shl nuw i64 %widea, 32
  %wide = or i64 %high, %wideb
  %cmodwidth = and i32 %c, 31
  %cmodwidthwide = zext i32 %cmodwidth to i64
  %shifted = lshr i64 %wide, %cmodwidthwide
  %trunc = trunc i64 %shifted to i32
  ret i32 %trunc
}
LebedevRI commented 5 years ago

Am i misreading the langref (and thus the ir is invalid), or is the naive funnel shift IR not being canonicalized? https://godbolt.org/z/W6WxrQ

IR looks good. Ok, thank you for checking, that is reassuring.

We don't have that, these alternate DAG funnel expansions: // fshl: (X << (Z % BW)) | (Y >> (BW - (Z % BW))) // fshr: (X << (BW - (Z % BW))) | (Y >> (Z % BW))

...or even the simpler rotate patterns canonicalized in IR yet. I keep thinking we're close to done with this, but I've been perpetually too optimistic.

rotateright commented 5 years ago

Am i misreading the langref (and thus the ir is invalid), or is the naive funnel shift IR not being canonicalized? https://godbolt.org/z/W6WxrQ

IR looks good. We don't have that, these alternate DAG funnel expansions: // fshl: (X << (Z % BW)) | (Y >> (BW - (Z % BW))) // fshr: (X << (BW - (Z % BW))) | (Y >> (Z % BW))

...or even the simpler rotate patterns canonicalized in IR yet. I keep thinking we're close to done with this, but I've been perpetually too optimistic.

LebedevRI commented 5 years ago

Am i misreading the langref (and thus the ir is invalid), or is the naive funnel shift IR not being canonicalized? https://godbolt.org/z/W6WxrQ

rotateright commented 5 years ago

AFAIK, this is the last step that was needed before we add the transform to produce intrinsics in instcombine: https://reviews.llvm.org/rL346854

We should be the pattern that includes select into the expected {sub/and/and/shl/lshr/or} form with this change: https://reviews.llvm.org/rL346807

And we have better support to reduce the IR once it is in intrinsic form with: https://reviews.llvm.org/rL346814

rotateright commented 5 years ago

Bug 34924 shows a rotate variant that includes compare+branch, so we'll need to handle that as well as a 'select' variant of that pattern:

define i32 @rot_select(i32 %x, i32 %shamt){
  %cmp = icmp eq i32 %shamt, 0
  %shr = lshr i32 %x, %shamt
  %sub = sub i32 32, %shamt
  %shl = shl i32 %x, %sub
  %or = or i32 %shl, %shr
  %r = select i1 %cmp, i32 %x, i32 %or
  ret i32 %r
}

https://rise4fun.com/Alive/yMQ

rotateright commented 5 years ago

And how does this interact with vectorization?

We have to enhance the vectorizers to recognize the funnel shift intrinsics to avoid regressions: https://reviews.llvm.org/rL346661

Getting the cost models in shape: https://reviews.llvm.org/rL346683 https://reviews.llvm.org/rL346688

Converting narrow-able rotate patterns to minimal shift+logic ops: https://reviews.llvm.org/rL346713 https://reviews.llvm.org/rL346716

rotateright commented 5 years ago

And how does this interact with vectorization?

We have to enhance the vectorizers to recognize the funnel shift intrinsics to avoid regressions: https://reviews.llvm.org/rL346661

rotateright commented 5 years ago

Apologies for not updating rotate bugs with current status. We added more general funnel shift intrinsics to LLVM: https://reviews.llvm.org/D49242

These can be used to represent rotates in IR and will become canonical form (hopefully not too far away).

Given that we have these new intrinsics, I'm going to morph this bug into a request to canonicalize the IR shown here to the intrinsics. (There's no reason to pursue a backend hack in CGP to accomplish the same thing.)

Trass3r commented 5 years ago

Bug llvm/llvm-project#21124 has been marked as a duplicate of this bug.

Trass3r commented 5 years ago

Please also consider the UB avoiding pattern: (x << (n % width)) | (x >> (-n % width))

And how does this interact with vectorization?

rotateright commented 6 years ago

If we add a rotate intrinsic: http://lists.llvm.org/pipermail/llvm-dev/2018-May/123292.html ...then we shouldn't need this chunk in CGP.