llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.33k stars 11.7k forks source link

Missing simplification (x > y) AND (x > z) to x > MAX (y,z)) #41552

Open davidbolvansky opened 5 years ago

davidbolvansky commented 5 years ago
Bugzilla Link 42207
Version trunk
OS Linux
CC @efriedma-quic,@fhahn,@LebedevRI,@RKSimon,@nikic,@rotateright,@vdsered

Extended Description

unsigned int max (unsigned int i, unsigned int m, unsigned int n) {
    return i > m && i > n;
}

unsigned int max2 (unsigned int i, unsigned int m, unsigned int n) {
    return i > std::max (m,n);
}

Clang -O3

max(unsigned int, unsigned int, unsigned int):
  cmp edi, esi
  seta al
  cmp edi, edx
  seta cl
  and cl, al
  movzx eax, cl
  ret
max2(unsigned int, unsigned int, unsigned int):  int, unsigned int)
  cmp esi, edx
  cmovb esi, edx
  xor eax, eax
  cmp esi, edi
  setb al
  ret

GCC generates same code for both functions.

rotateright commented 3 years ago

Can you explain why umin(1, undef) is 0? It is a subtle fact for me. I didn't expect such behaviour for the intrinsic and haven't found anything like that in LangRef

Undef logic is complicated. This is one reason we're moving away from it in IR.

To make min/max behave as expected with known bits, we clamp to the min/max value of the value type. The code that implements it is here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/ConstantFolding.cpp#L2463

If we have umin(1, undef), we can't return undef because some other analysis could then assume that the resulting undef value was "2" for example, but that would create a logic conflict with the definition of "umin".

The LangRef doesn't state this explicitly anywhere AFAIK, but you might infer it from the "and/or" logic examples in: https://llvm.org/docs/LangRef.html#undefined-values

More info here: https://llvm.org/devmtg/2020-09/slides/Lee-UndefPoison.pdf

dseredkin commented 3 years ago
define i1 @src(i32 %x, i32 %y, i32 %z) {
  %cmp1 = icmp ult i32 %x, %y
  %cmp2 = icmp ult i32 %x, %z
  %r = and i1 %cmp1, %cmp2
  ret i1 %r
}

define i1 @tgt(i32 %x, i32 %y, i32 %z) {
  %min1 = call i32 @llvm.umin.i32(i32 %y, i32 %z)
  %r = icmp ult i32 %x, %min1
  ret i1 %r
}

As I understand for input like X = 2, Y = 1, Z = undef the output will be src(2, 1, undef) == false tgt(2, 1, undef) == undef

Is it right? If so, then Alive seems to be wrong :(

The result of (2,1,undef) input should be false for both: umin(1, undef) --> 0 icmp ult 2, 0 --> false

So I don't see a problem with Alive on this example.

Can you explain why umin(1, undef) is 0? It is a subtle fact for me. I didn't expect such behaviour for the intrinsic and haven't found anything like that in LangRef

rotateright commented 3 years ago
define i1 @src(i32 %x, i32 %y, i32 %z) {
  %cmp1 = icmp ult i32 %x, %y
  %cmp2 = icmp ult i32 %x, %z
  %r = and i1 %cmp1, %cmp2
  ret i1 %r
}

define i1 @tgt(i32 %x, i32 %y, i32 %z) {
  %min1 = call i32 @llvm.umin.i32(i32 %y, i32 %z)
  %r = icmp ult i32 %x, %min1
  ret i1 %r
}

As I understand for input like X = 2, Y = 1, Z = undef the output will be src(2, 1, undef) == false tgt(2, 1, undef) == undef

Is it right? If so, then Alive seems to be wrong :(

The result of (2,1,undef) input should be false for both: umin(1, undef) --> 0 icmp ult 2, 0 --> false

So I don't see a problem with Alive on this example.

dseredkin commented 3 years ago

Another thing is that Alive does says that the transformation is correct https://alive2.llvm.org/ce/z/p6hmwV. I am not sure, but I suspect that it can be wrong.

define i1 @src(i32 %x, i32 %y, i32 %z) {
  %cmp1 = icmp ult i32 %x, %y
  %cmp2 = icmp ult i32 %x, %z
  %r = and i1 %cmp1, %cmp2
  ret i1 %r
}

define i1 @tgt(i32 %x, i32 %y, i32 %z) {
  %min1 = call i32 @llvm.umin.i32(i32 %y, i32 %z)
  %r = icmp ult i32 %x, %min1
  ret i1 %r
}

As I understand for input like X = 2, Y = 1, Z = undef the output will be src(2, 1, undef) == false tgt(2, 1, undef) == undef

Is it right? If so, then Alive seems to be wrong :(

dseredkin commented 3 years ago

I mentioned one such case where min form provides a better codegen in the comment 2.

The suggestion to generate the same code for both sounds reasonable, but should we at first ask why the patterns are not recognized as similiar and probably improve instruction selection?

davidbolvansky commented 3 years ago

I mentioned one such case where min form provides a better codegen in the comment 2.

dseredkin commented 3 years ago

I am not sure, but why would we need to canonicalize x < y && x < z to min?

From

%cmp1 = icmp ult i32 %x, %y %cmp2 = icmp ult i32 %x, %z %and1 = and i1 %cmp1, %cmp2

To

%min1 = call i32 @​llvm.umin.i32(i32 %y, i32 %z) %cmp1 = icmp ult i32 %x, %min1

Wouldn't the form with two icmps be better as a canonical form?

I agree with Nikita that the form with icmps is more analyzable because we wouldn't need to invert min/max in other passes (I guess like LICM in case one of min's operands is loop-variant) or do special handling in passes like ConstraintElimination.

Wouldn't it be better to convert pattern like Sanjay's example below and min to the one with two icmps + and?

%cmp.i.i.i = icmp ult i32 %m, %n %.sroa.speculated = select i1 %cmp.i.i.i, i32 %n, i32 %m %cmp = icmp ult i32 %.sroa.speculated, %i

dseredkin commented 3 years ago

This is quite questionable because if new max(..) is loop invariant now, this transformation will help loop optimizers too.

This goes both ways though: What if i > n is loop invariant and i > m isn't? Could have been unswitched beforehand, and no longer after the transform.

I understand that it can help in some cases, but my intuition is that it will do more harm than good. Of course, my intuition may well be wrong :)

If we want to canonicalize in this direction, it would be necessary to at least check how passes that do some kind of condition inference would deal with the new form, such as ValueTracking, LazyValueInfo, PredicateInfo, ConstraintElimination.

Not sure yet how ValueTracking/LazyValueInfo/PredicateInfo handle cases with min/max, but ConstraintElimination would be affected by this transformation, below are examples

1) ConstraintElimination can detect that %f.1 is redundant because of dominating conditions in pre

declare void @use(i1)
declare i32 @llvm.umax.i32(i32,i32)
define void @test_without_max(i32 %x, i32 %y, i32 %z, i1 %c) {
entry:
  br i1 %c, label %pre, label %exit

exit:
  ret void

pre:
  %x.1 = icmp ugt i32 %x, %y
  %y.1 = icmp ugt i32 %x, %z
  %and = and i1 %x.1, %y.1
  br i1 %and, label %loop, label %exit

loop:
  %f.1 = icmp ult i32 %x, %y
  call void @&#8203;use(i1 %f.1)
  br i1 true, label %exit, label %loop
}

In the debug output:
Condition !  %f.1 = icmp ult i32 %x, %y implied by dominating constraints
   C   %y.1 = icmp ugt i32 %x, %z 0
   C   %x.1 = icmp ugt i32 %x, %y 0

2) ConstraintElimination cannot detect that %f.1 is redundant here

declare void @use(i1)
declare i32 @llvm.umax.i32(i32,i32)
define void @test_with_max(i32 %x, i32 %y, i32 %z, i1 %c) {
entry:
  br i1 %c, label %pre, label %exit

exit:
  ret void

pre:
  %max1 = call i32 @llvm.umax.i32(i32 %y, i32 %z)
  %cmp1 = icmp ugt i32 %x, %max1
  br i1 %cmp1, label %loop, label %exit

loop:
  %f.1 = icmp ult i32 %x, %y
  call void @use(i1 %f.1)
  br i1 true, label %exit, label %loop
}

However, Florian Hanh said that he has a patch downstream for the pass that could handle cases like with umax in the examples above. So I guess at least 1 of 4 passes will work without any problems :)

To notice those who will check the passes that Nikita mentioned, I intend to look at PredicateInfo to figure out if min/max do not harm this pass

nikic commented 3 years ago

This is quite questionable because if new max(..) is loop invariant now, this transformation will help loop optimizers too.

This goes both ways though: What if i > n is loop invariant and i > m isn't? Could have been unswitched beforehand, and no longer after the transform.

I understand that it can help in some cases, but my intuition is that it will do more harm than good. Of course, my intuition may well be wrong :)

If we want to canonicalize in this direction, it would be necessary to at least check how passes that do some kind of condition inference would deal with the new form, such as ValueTracking, LazyValueInfo, PredicateInfo, ConstraintElimination.

rotateright commented 3 years ago

IMHO this transform shouldn't be done at the IR level at all. I think that (i > m && i > n) is more analyzable than i > max(m, n). This would make more sense to me at the DAG level, if it's profitable there.

I think that's a reasonable alternative, but that means we should be inverting the transform (away from the intrinsics and toward logic-of-icmps)?

Note that even without that, we don't canonicalize the C++ examples in the description to the same form:

define i32 @​_Z3maxjjj(i32 %i, i32 %m, i32 %n) { %cmp = icmp ugt i32 %i, %m %cmp1 = icmp ugt i32 %i, %n %0 = and i1 %cmp, %cmp1 %conv = zext i1 %0 to i32 ret i32 %conv }

define i32 @​_Z4max2jjj(i32 %i, i32 %m, i32 %n) { %cmp.i.i.i = icmp ult i32 %m, %n %.sroa.speculated = select i1 %cmp.i.i.i, i32 %n, i32 %m %cmp = icmp ult i32 %.sroa.speculated, %i %conv = zext i1 %cmp to i32 ret i32 %conv }

davidbolvansky commented 3 years ago

This is quite questionable because if new max(..) is loop invariant now, this transformation will help loop optimizers too.

GCC does this transformation at the GCC IR level.

nikic commented 3 years ago

IMHO this transform shouldn't be done at the IR level at all. I think that (i > m && i > n) is more analyzable than i > max(m, n). This would make more sense to me at the DAG level, if it's profitable there.

rotateright commented 3 years ago

Here's a patch that shows where we are lacking in min/max transforms with the intrinsics: https://reviews.llvm.org/D98152

My guess is that we could add the set of transforms requested in this bug report without seeing regressions.

But if we can fix up any of the missing folds noted in D98152 first, that would reduce risk.

dseredkin commented 3 years ago

We could probably implement as a part of the issue transformations for similar patterns

; 1) x > y || x > z -> x > min(y,z)

define i1 @min1(i32 %i, i32 %m, i32 %n) {
  %cmp = icmp ugt i32 %i, %m
  %cmp1 = icmp ugt i32 %i, %n
  %r = or i1 %cmp, %cmp1
  ret i1 %r
}

; 2) x < y || x < z -> x < max(y,z)

define i1 @max1(i32 %i, i32 %m, i32 %n) {
  %cmp = icmp ult i32 %i, %m
  %cmp1 = icmp ult i32 %i, %n
  %r = or i1 %cmp, %cmp1
  ret i1 %r
}

; 3) x >= y || x >= z -> x >= min(y,z)

define i1 @min2(i32 %i, i32 %m, i32 %n) {
  %cmp = icmp uge i32 %i, %m
  %cmp1 = icmp uge i32 %i, %n
  %r = or i1 %cmp, %cmp1
  ret i1 %r
}

; 4) x <= y || x <= z -> x <= max(y,z)

define i1 @max2(i32 %i, i32 %m, i32 %n) {
  %cmp = icmp ule i32 %i, %m
  %cmp1 = icmp ule i32 %i, %n
  %r = or i1 %cmp, %cmp1
  ret i1 %r
}

It looks like GCC can handle all of them https://c.godbolt.org/z/K8fz4v

RKSimon commented 3 years ago

A similar pattern is discussed here: http://0x80.pl/notesen/2021-03-11-any-word-is-zero.html

bool any_zero(unsigned x, unsigned y) {
    return x == 0 || y == 0;
}

bool any_zero_min(unsigned x, unsigned y) {
    return std::min(x, y) == 0;
}

https://c.godbolt.org/z/osabnd

rotateright commented 4 years ago

In IR, the patterns to match will look like this (with predicate inverted/swapped variants):

define i1 @umax(i32 %i, i32 %m, i32 %n) {
  %cmp = icmp ugt i32 %i, %m
  %cmp1 = icmp ugt i32 %i, %n
  %r = and i1 %cmp, %cmp1
  ret i1 %r
}

https://alive2.llvm.org/ce/z/ccTKfG

To be safer (avoid potential regressions), we should solve bug 46897 before we start creating intrinsics in IR.

davidbolvansky commented 4 years ago

This should be easy now with min/max intrinsics.

davidbolvansky commented 5 years ago
int p(int n, int m, int *arr) {
    int s =0;
    for (int i = 0; i < n && i < m; ++i) {
        s += arr[i];
    }
    return s;
}

int px(int n, int m, int *arr) {
    int s =0;

    n = std::min(n, m);
    for (int i = 0; i < n; ++i) {
        s += arr[i];
    }
    return s;
}

This pattern could help here a lot. GCC generates same code (+- label names) for both functions, Clang does not.

Our problem is..

((x < y) AND (x < z) -> x < MIN (y,z)

In MIN, we increase thenumber of uses of Y/Z. Similar patch failed on this issue too https://reviews.llvm.org/D63060.

Isn't this a good motivation for new smin/smax/umin/umax intrinsics?

I think we should implement this pattern anyway, even if we don't have any those intrinsics. I think we would gain more, than we can lose on missed optimizations on "Y/Z"..

Opinions?

davidbolvansky commented 5 years ago

Second pattern (quite common as loop condition)

((x < y) AND (x < z) -> x < MIN (y,z)

davidbolvansky commented 1 year ago

Possibly fully or partially fixed with https://reviews.llvm.org/D143726