Open ec04fc15-fa35-46f2-80e1-5d271f2ef708 opened 3 years ago
mentioned in issue llvm/llvm-bugzilla-archive#51305
Alright, i think the only thing we are missing before we can close this is cost model changes.
This block: https://github.com/llvm/llvm-project/blob/01d59c0de822099c62f12f275c41338f6df9f5ac/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp#L1978-L2147 needs to be effectively copied here: https://github.com/llvm/llvm-project/blob/01d59c0de822099c62f12f275c41338f6df9f5ac/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L1533 with an appropriate set of tests.
<n x i1>
,
and all users are such integral reductions, then the cost of said extension
should be 0, since it will be gone.@Simon: might you be interested in handling with this? If not, i could finish this i guess.
Awesome, thanks! Minor point: the original two examples still don't canonicalize to the same IR:
f1 uses
%5 = bitcast <8 x i1> %4 to i8 %6 = icmp eq i8 %5, 0
resulting in
vpmovmskb eax, xmm0 test al, al
f2 uses
%5 = sext <8 x i1> %4 to <8 x i8> %6 = bitcast <8 x i8> %5 to i64 %7 = icmp eq i64 %6, 0
vmovq rax, xmm0 test rax, rax
I have no idea which of these is better (if either), but presumably there's an instcombine missing or similar?
Yep, looks like we fail to drop sext before bitcast-icmp: https://godbolt.org/z/Mf8vMfs5r
Awesome, thanks! Minor point: the original two examples still don't canonicalize to the same IR:
f1 uses
%5 = bitcast <8 x i1> %4 to i8 %6 = icmp eq i8 %5, 0
resulting in
vpmovmskb eax, xmm0
test al, al
f2 uses
%5 = sext <8 x i1> %4 to <8 x i8> %6 = bitcast <8 x i8> %5 to i64 %7 = icmp eq i64 %6, 0
vmovq rax, xmm0
test rax, rax
I have no idea which of these is better (if either), but presumably there's an instcombine missing or similar?
Ok, i've added all the missing instcombine pieces, all integer reductions w/ (potentially extended i1 element type are now expanded.
We now probably want to adjust costmodel to reflect that, and i'm not sure if we want to do some similar backend handling?
Should we always be canonicalizing
and/or/xor reductions to bitcasts+icmp/parity do you think? IIRC we already do this in a couple of cases already.
Surely.
https://godbolt.org/z/fhrPzvxzv So we are missing xor/mul/comparison handling. i1 xor is just add: https://alive2.llvm.org/ce/z/GZToKH i1 mul is just and: https://alive2.llvm.org/ce/z/R4Apqp i1 umax is just or: https://alive2.llvm.org/ce/z/dSE3x_ i1 smin is just or: https://alive2.llvm.org/ce/z/49Z8yh i1 umin is just and: https://alive2.llvm.org/ce/z/SfZ2pU i1 smax is just and: https://alive2.llvm.org/ce/z/ndJ2Ze
Should we always be canonicalizing
and/or/xor reductions to bitcasts+icmp/parity do you think? IIRC we already do this in a couple of cases already.
Already done with: https://reviews.llvm.org/D97406
The godbolt link in the description is testing with 12.0.1. Here's trunk: https://godbolt.org/z/6vE6T81Yn
There's still a difference in the IR and asm though. We might want to change both.
Should we always be canonicalizing
IIRC we already do this in a couple of cases already.
It looks like we might be missing a simplification of @llvm.vector.reduce.and on a vector of i1: replacing
%5 = call i1 @llvm.vector.reduce.and.v8i1(<8 x i1> %4)
with an icmp looks like it should result in the better IR. (Thanks to Nick Lewycky for pointing this out.)
assigned to @LebedevRI
Extended Description
Live demo: https://godbolt.org/z/a6YaahdPP
Example:
This results in the following:
I think these should produce the same assembly, and the result from f2 looks better to me (though both are the same size). Presumably we'd need to recognize that after vpcmpeqb, each lane in xmm0 is either all-zeros or all-ones, so the vpmovmskb is redundant.