Open davidbolvansky opened 5 years ago
Can you explain why umin(1, undef) is 0? It is a subtle fact for me. I didn't expect such behaviour for the intrinsic and haven't found anything like that in LangRef
Undef logic is complicated. This is one reason we're moving away from it in IR.
To make min/max behave as expected with known bits, we clamp to the min/max value of the value type. The code that implements it is here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/ConstantFolding.cpp#L2463
If we have umin(1, undef), we can't return undef because some other analysis could then assume that the resulting undef value was "2" for example, but that would create a logic conflict with the definition of "umin".
The LangRef doesn't state this explicitly anywhere AFAIK, but you might infer it from the "and/or" logic examples in: https://llvm.org/docs/LangRef.html#undefined-values
More info here: https://llvm.org/devmtg/2020-09/slides/Lee-UndefPoison.pdf
define i1 @src(i32 %x, i32 %y, i32 %z) { %cmp1 = icmp ult i32 %x, %y %cmp2 = icmp ult i32 %x, %z %r = and i1 %cmp1, %cmp2 ret i1 %r } define i1 @tgt(i32 %x, i32 %y, i32 %z) { %min1 = call i32 @llvm.umin.i32(i32 %y, i32 %z) %r = icmp ult i32 %x, %min1 ret i1 %r }
As I understand for input like X = 2, Y = 1, Z = undef the output will be src(2, 1, undef) == false tgt(2, 1, undef) == undef
Is it right? If so, then Alive seems to be wrong :(
The result of (2,1,undef) input should be false for both: umin(1, undef) --> 0 icmp ult 2, 0 --> false
So I don't see a problem with Alive on this example.
Can you explain why umin(1, undef) is 0? It is a subtle fact for me. I didn't expect such behaviour for the intrinsic and haven't found anything like that in LangRef
define i1 @src(i32 %x, i32 %y, i32 %z) { %cmp1 = icmp ult i32 %x, %y %cmp2 = icmp ult i32 %x, %z %r = and i1 %cmp1, %cmp2 ret i1 %r } define i1 @tgt(i32 %x, i32 %y, i32 %z) { %min1 = call i32 @llvm.umin.i32(i32 %y, i32 %z) %r = icmp ult i32 %x, %min1 ret i1 %r }
As I understand for input like X = 2, Y = 1, Z = undef the output will be src(2, 1, undef) == false tgt(2, 1, undef) == undef
Is it right? If so, then Alive seems to be wrong :(
The result of (2,1,undef) input should be false for both: umin(1, undef) --> 0 icmp ult 2, 0 --> false
So I don't see a problem with Alive on this example.
Another thing is that Alive does says that the transformation is correct https://alive2.llvm.org/ce/z/p6hmwV. I am not sure, but I suspect that it can be wrong.
define i1 @src(i32 %x, i32 %y, i32 %z) {
%cmp1 = icmp ult i32 %x, %y
%cmp2 = icmp ult i32 %x, %z
%r = and i1 %cmp1, %cmp2
ret i1 %r
}
define i1 @tgt(i32 %x, i32 %y, i32 %z) {
%min1 = call i32 @llvm.umin.i32(i32 %y, i32 %z)
%r = icmp ult i32 %x, %min1
ret i1 %r
}
As I understand for input like X = 2, Y = 1, Z = undef the output will be src(2, 1, undef) == false tgt(2, 1, undef) == undef
Is it right? If so, then Alive seems to be wrong :(
I mentioned one such case where min form provides a better codegen in the comment 2.
The suggestion to generate the same code for both sounds reasonable, but should we at first ask why the patterns are not recognized as similiar and probably improve instruction selection?
I mentioned one such case where min form provides a better codegen in the comment 2.
I am not sure, but why would we need to canonicalize x < y && x < z to min?
From
%cmp1 = icmp ult i32 %x, %y %cmp2 = icmp ult i32 %x, %z %and1 = and i1 %cmp1, %cmp2
To
%min1 = call i32 @llvm.umin.i32(i32 %y, i32 %z) %cmp1 = icmp ult i32 %x, %min1
Wouldn't the form with two icmps be better as a canonical form?
I agree with Nikita that the form with icmps is more analyzable because we wouldn't need to invert min/max in other passes (I guess like LICM in case one of min's operands is loop-variant) or do special handling in passes like ConstraintElimination.
Wouldn't it be better to convert pattern like Sanjay's example below and min to the one with two icmps + and?
%cmp.i.i.i = icmp ult i32 %m, %n %.sroa.speculated = select i1 %cmp.i.i.i, i32 %n, i32 %m %cmp = icmp ult i32 %.sroa.speculated, %i
This is quite questionable because if new max(..) is loop invariant now, this transformation will help loop optimizers too.
This goes both ways though: What if i > n is loop invariant and i > m isn't? Could have been unswitched beforehand, and no longer after the transform.
I understand that it can help in some cases, but my intuition is that it will do more harm than good. Of course, my intuition may well be wrong :)
If we want to canonicalize in this direction, it would be necessary to at least check how passes that do some kind of condition inference would deal with the new form, such as ValueTracking, LazyValueInfo, PredicateInfo, ConstraintElimination.
Not sure yet how ValueTracking/LazyValueInfo/PredicateInfo handle cases with min/max, but ConstraintElimination would be affected by this transformation, below are examples
1) ConstraintElimination can detect that %f.1 is redundant because of dominating conditions in pre
declare void @use(i1)
declare i32 @llvm.umax.i32(i32,i32)
define void @test_without_max(i32 %x, i32 %y, i32 %z, i1 %c) {
entry:
br i1 %c, label %pre, label %exit
exit:
ret void
pre:
%x.1 = icmp ugt i32 %x, %y
%y.1 = icmp ugt i32 %x, %z
%and = and i1 %x.1, %y.1
br i1 %and, label %loop, label %exit
loop:
%f.1 = icmp ult i32 %x, %y
call void @​use(i1 %f.1)
br i1 true, label %exit, label %loop
}
In the debug output:
Condition ! %f.1 = icmp ult i32 %x, %y implied by dominating constraints
C %y.1 = icmp ugt i32 %x, %z 0
C %x.1 = icmp ugt i32 %x, %y 0
2) ConstraintElimination cannot detect that %f.1 is redundant here
declare void @use(i1)
declare i32 @llvm.umax.i32(i32,i32)
define void @test_with_max(i32 %x, i32 %y, i32 %z, i1 %c) {
entry:
br i1 %c, label %pre, label %exit
exit:
ret void
pre:
%max1 = call i32 @llvm.umax.i32(i32 %y, i32 %z)
%cmp1 = icmp ugt i32 %x, %max1
br i1 %cmp1, label %loop, label %exit
loop:
%f.1 = icmp ult i32 %x, %y
call void @use(i1 %f.1)
br i1 true, label %exit, label %loop
}
However, Florian Hanh said that he has a patch downstream for the pass that could handle cases like with umax in the examples above. So I guess at least 1 of 4 passes will work without any problems :)
To notice those who will check the passes that Nikita mentioned, I intend to look at PredicateInfo to figure out if min/max do not harm this pass
This is quite questionable because if new max(..) is loop invariant now, this transformation will help loop optimizers too.
This goes both ways though: What if i > n is loop invariant and i > m isn't? Could have been unswitched beforehand, and no longer after the transform.
I understand that it can help in some cases, but my intuition is that it will do more harm than good. Of course, my intuition may well be wrong :)
If we want to canonicalize in this direction, it would be necessary to at least check how passes that do some kind of condition inference would deal with the new form, such as ValueTracking, LazyValueInfo, PredicateInfo, ConstraintElimination.
IMHO this transform shouldn't be done at the IR level at all. I think that (i > m && i > n) is more analyzable than i > max(m, n). This would make more sense to me at the DAG level, if it's profitable there.
I think that's a reasonable alternative, but that means we should be inverting the transform (away from the intrinsics and toward logic-of-icmps)?
Note that even without that, we don't canonicalize the C++ examples in the description to the same form:
define i32 @_Z3maxjjj(i32 %i, i32 %m, i32 %n) { %cmp = icmp ugt i32 %i, %m %cmp1 = icmp ugt i32 %i, %n %0 = and i1 %cmp, %cmp1 %conv = zext i1 %0 to i32 ret i32 %conv }
define i32 @_Z4max2jjj(i32 %i, i32 %m, i32 %n) { %cmp.i.i.i = icmp ult i32 %m, %n %.sroa.speculated = select i1 %cmp.i.i.i, i32 %n, i32 %m %cmp = icmp ult i32 %.sroa.speculated, %i %conv = zext i1 %cmp to i32 ret i32 %conv }
This is quite questionable because if new max(..) is loop invariant now, this transformation will help loop optimizers too.
GCC does this transformation at the GCC IR level.
IMHO this transform shouldn't be done at the IR level at all. I think that (i > m && i > n) is more analyzable than i > max(m, n). This would make more sense to me at the DAG level, if it's profitable there.
Here's a patch that shows where we are lacking in min/max transforms with the intrinsics: https://reviews.llvm.org/D98152
My guess is that we could add the set of transforms requested in this bug report without seeing regressions.
But if we can fix up any of the missing folds noted in D98152 first, that would reduce risk.
We could probably implement as a part of the issue transformations for similar patterns
; 1) x > y || x > z -> x > min(y,z)
define i1 @min1(i32 %i, i32 %m, i32 %n) {
%cmp = icmp ugt i32 %i, %m
%cmp1 = icmp ugt i32 %i, %n
%r = or i1 %cmp, %cmp1
ret i1 %r
}
; 2) x < y || x < z -> x < max(y,z)
define i1 @max1(i32 %i, i32 %m, i32 %n) {
%cmp = icmp ult i32 %i, %m
%cmp1 = icmp ult i32 %i, %n
%r = or i1 %cmp, %cmp1
ret i1 %r
}
; 3) x >= y || x >= z -> x >= min(y,z)
define i1 @min2(i32 %i, i32 %m, i32 %n) {
%cmp = icmp uge i32 %i, %m
%cmp1 = icmp uge i32 %i, %n
%r = or i1 %cmp, %cmp1
ret i1 %r
}
; 4) x <= y || x <= z -> x <= max(y,z)
define i1 @max2(i32 %i, i32 %m, i32 %n) {
%cmp = icmp ule i32 %i, %m
%cmp1 = icmp ule i32 %i, %n
%r = or i1 %cmp, %cmp1
ret i1 %r
}
It looks like GCC can handle all of them https://c.godbolt.org/z/K8fz4v
A similar pattern is discussed here: http://0x80.pl/notesen/2021-03-11-any-word-is-zero.html
bool any_zero(unsigned x, unsigned y) {
return x == 0 || y == 0;
}
bool any_zero_min(unsigned x, unsigned y) {
return std::min(x, y) == 0;
}
In IR, the patterns to match will look like this (with predicate inverted/swapped variants):
define i1 @umax(i32 %i, i32 %m, i32 %n) {
%cmp = icmp ugt i32 %i, %m
%cmp1 = icmp ugt i32 %i, %n
%r = and i1 %cmp, %cmp1
ret i1 %r
}
https://alive2.llvm.org/ce/z/ccTKfG
To be safer (avoid potential regressions), we should solve bug 46897 before we start creating intrinsics in IR.
This should be easy now with min/max intrinsics.
int p(int n, int m, int *arr) {
int s =0;
for (int i = 0; i < n && i < m; ++i) {
s += arr[i];
}
return s;
}
int px(int n, int m, int *arr) {
int s =0;
n = std::min(n, m);
for (int i = 0; i < n; ++i) {
s += arr[i];
}
return s;
}
This pattern could help here a lot. GCC generates same code (+- label names) for both functions, Clang does not.
Our problem is..
((x < y) AND (x < z) -> x < MIN (y,z)
In MIN, we increase thenumber of uses of Y/Z. Similar patch failed on this issue too https://reviews.llvm.org/D63060.
Isn't this a good motivation for new smin/smax/umin/umax intrinsics?
I think we should implement this pattern anyway, even if we don't have any those intrinsics. I think we would gain more, than we can lose on missed optimizations on "Y/Z"..
Opinions?
Second pattern (quite common as loop condition)
((x < y) AND (x < z) -> x < MIN (y,z)
Possibly fully or partially fixed with https://reviews.llvm.org/D143726
Extended Description
Clang -O3
GCC generates same code for both functions.