missing simplification for llvm.vector.reduce.and on vector of i1

ec04fc15-fa35-46f2-80e1-5d271f2ef708 commented 3 years ago


Bugzilla Link	51259
Version	trunk
OS	All
CC	@topperc,@LebedevRI,@RKSimon,@phoebewang,@rotateright

Extended Description

Live demo: https://godbolt.org/z/a6YaahdPP

Example:

[[gnu::weak]] void do_this() {}
[[gnu::weak]] void do_that() {}

void f1(unsigned char const p[8]) {
  if (p[0] != 0x00 & p[1] != 0x00 & p[2] != 0x00 & p[3] != 0x00 & p[4] != 0x00 &
      p[5] != 0x00 & p[6] != 0x00 & p[7] != 0x00) {
    do_this();
  } else {
    do_that();
  }
}

void f2(unsigned const char *p) {
  using T [[gnu::vector_size(8), gnu::aligned(1)]] = unsigned char;
  T same = *(T *)p == (T){0, 0, 0, 0, 0, 0, 0, 0};
  if ((unsigned long)same == 0) {
    do_this();
  } else {
    do_that();
  }
}

This results in the following:

f1(unsigned char const*):                               # @f1(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vpmovmskb       eax, xmm0
        not     eax
        cmp     al, -1
        jne     ...

f2(unsigned char const*):                               # @f2(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        test    rax, rax
        je      ...

I think these should produce the same assembly, and the result from f2 looks better to me (though both are the same size). Presumably we'd need to recognize that after vpcmpeqb, each lane in xmm0 is either all-zeros or all-ones, so the vpmovmskb is redundant.

LebedevRI commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#51305

LebedevRI commented 3 years ago

Alright, i think the only thing we are missing before we can close this is cost model changes.

This block: https://github.com/llvm/llvm-project/blob/01d59c0de822099c62f12f275c41338f6df9f5ac/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp#L1978-L2147 needs to be effectively copied here: https://github.com/llvm/llvm-project/blob/01d59c0de822099c62f12f275c41338f6df9f5ac/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L1533 with an appropriate set of tests.

iff we are asking for the cost of an zext/sext from <n x i1>, and all users are such integral reductions, then the cost of said extension should be 0, since it will be gone.

@Simon: might you be interested in handling with this? If not, i could finish this i guess.

LebedevRI commented 3 years ago

Awesome, thanks! Minor point: the original two examples still don't canonicalize to the same IR:

f1 uses

%5 = bitcast <8 x i1> %4 to i8 %6 = icmp eq i8 %5, 0

resulting in
    vpmovmskb       eax, xmm0
    test    al, al
f2 uses

%5 = sext <8 x i1> %4 to <8 x i8> %6 = bitcast <8 x i8> %5 to i64 %7 = icmp eq i64 %6, 0
    vmovq   rax, xmm0
    test    rax, rax
I have no idea which of these is better (if either), but presumably there's an instcombine missing or similar?

Yep, looks like we fail to drop sext before bitcast-icmp: https://godbolt.org/z/Mf8vMfs5r

ec04fc15-fa35-46f2-80e1-5d271f2ef708 commented 3 years ago

Awesome, thanks! Minor point: the original two examples still don't canonicalize to the same IR:

f1 uses

%5 = bitcast <8 x i1> %4 to i8 %6 = icmp eq i8 %5, 0

resulting in

    vpmovmskb       eax, xmm0
    test    al, al

f2 uses

%5 = sext <8 x i1> %4 to <8 x i8> %6 = bitcast <8 x i8> %5 to i64 %7 = icmp eq i64 %6, 0

    vmovq   rax, xmm0
    test    rax, rax

I have no idea which of these is better (if either), but presumably there's an instcombine missing or similar?

LebedevRI commented 3 years ago

Ok, i've added all the missing instcombine pieces, all integer reductions w/ (potentially extended i1 element type are now expanded.

We now probably want to adjust costmodel to reflect that, and i'm not sure if we want to do some similar backend handling?

LebedevRI commented 3 years ago

Should we always be canonicalizing and/or/xor reductions to bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

Surely.

https://godbolt.org/z/fhrPzvxzv So we are missing xor/mul/comparison handling. i1 xor is just add: https://alive2.llvm.org/ce/z/GZToKH i1 mul is just and: https://alive2.llvm.org/ce/z/R4Apqp i1 umax is just or: https://alive2.llvm.org/ce/z/dSE3x_ i1 smin is just or: https://alive2.llvm.org/ce/z/49Z8yh i1 umin is just and: https://alive2.llvm.org/ce/z/SfZ2pU i1 smax is just and: https://alive2.llvm.org/ce/z/ndJ2Ze

rotateright commented 3 years ago

Should we always be canonicalizing and/or/xor reductions to bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

Already done with: https://reviews.llvm.org/D97406

The godbolt link in the description is testing with 12.0.1. Here's trunk: https://godbolt.org/z/6vE6T81Yn

There's still a difference in the IR and asm though. We might want to change both.

RKSimon commented 3 years ago

Should we always be canonicalizing and/or/xor reductions to bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

ec04fc15-fa35-46f2-80e1-5d271f2ef708 commented 3 years ago

It looks like we might be missing a simplification of @llvm.vector.reduce.and on a vector of i1: replacing

%5 = call i1 @llvm.vector.reduce.and.v8i1(<8 x i1> %4)

with an icmp looks like it should result in the better IR. (Thanks to Nick Lewycky for pointing this out.)

ec04fc15-fa35-46f2-80e1-5d271f2ef708 commented 3 years ago

assigned to @LebedevRI

llvm / llvm-project

missing simplification for llvm.vector.reduce.and on vector of i1 #50603

Extended Description