[x86] llvm.experimental.reduce.{and, any} don't lower properly for boolean vectors

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago


Bugzilla Link	36702
Resolution	FIXED
Resolved on	May 03, 2019 10:20
Version	trunk
OS	All
Blocks	llvm/llvm-project#36435
CC	@aemerson,@chandlerc,@topperc,@froydnj,@hfinkel,@hsivonen,@jdm,@RKSimon,@rotateright
Fixed by commit(s)	r358286

Extended Description

I'd expect a way to be able to lower llvm.experimental.vector.reduce.and to @llvm.x86.sse2.pmovmskb.128 on x86_64 and similar instructions in other architectures when the input vectors are boolean vectors, that is, when their lanes have either all bits cleared, or set. The same applies to llvm.experimental.vector.reduce.any.

For example (thanks Chandler for coming up with this):

declare i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1>) declare void @llvm.assume(i1) declare void @a() declare void @b() define void @stdsimd_all_8x16(<16 x i8> noalias nocapture dereferenceable(16) %ptr) unnamed_addr #0 { %v = load <16 x i8>, <16 x i8> %ptr, align 16 %t = trunc <16 x i8> %v to <16 x i1> %e1 = sext <16 x i1> %t to <16 x i8> %e2 = icmp eq <16 x i8> %v, %e1 %e3 = call i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1> %e2) call void @llvm.assume(i1 %e3) %x = call i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1> %t) br i1 %x, label %a, label %b a: call void @a() ret void b: call void @b() ret void }

should emit:

push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rdi] pmovmskb eax, xmm0 cmp eax, 65535 sete al pop rbp ret

and replacing "reduce.and" with "reduce.any" should emit:

push rbp mov rbp, rsp movdqa xmm0, xmmword, ptr, [rdi] pmovmskb eax, xmm0 test eax, eax setne al pop rbp ret

54aefcd4-c07d-4252-8441-723563c8826f commented 2 years ago

mentioned in issue llvm/llvm-project#36435

54aefcd4-c07d-4252-8441-723563c8826f commented 2 years ago

mentioned in issue llvm/llvm-project#36435

RKSimon commented 5 years ago

Resolving as this bug was originally about x86 (which now behaves).

I've split off bugs for powerpc/arm/aarch64

RKSimon commented 5 years ago

X86 improvement: https://reviews.llvm.org/D60610

This landed at rL358286 - I'd recommend splitting off the ARM/AArch64/PowerPC issues to their own bugs as these are all target specific.

RKSimon commented 5 years ago

X86 improvement: https://reviews.llvm.org/D60610

RKSimon commented 6 years ago

CC'ing Amara as he may be able to advise on the ARM/AARCH64 reduction codegen.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

On ppc64le with altivec and vsx (https://godbolt.org/g/pKsjVA) it generates:

stdsimd_all_8x16: # @stdsimd_all_8x16 .Lfunc_gep0: addis 2, 12, .TOC.-.Lfunc_gep0@ha addi 2, 2, .TOC.-.Lfunc_gep0@l mflr 0 std 0, 16(1) stdu 1, -32(1) lvx 2, 0, 3 xxswapd 35, 34 xxland 0, 34, 35 xxspltw 1, 0, 2 xxland 34, 0, 1 vsplth 19, 2, 6 xxland 34, 34, 51 vspltb 3, 2, 14 xxland 13, 34, 35 xxswapd 0, 13 mfvsrd 12, 0 clrldi 3, 12, 56 andi. 3, 3, 1 bc 4, 1, .LBB0_2 bl a nop b .LBB0_3 .LBB0_2: # %b bl b nop .LBB0_3: # %a addi 1, 1, 32 ld 0, 16(1) mtlr 0 blr .long 0 .quad 0

but IIUC it should just do a single vcmpequb and read the result from CR6.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

@Simon you can peek at our current solution here: https://github.com/rust-lang-nursery/stdsimd/pull/425

Things are particularly bad for ARM and AArch64, where currently scalar code is generated. For example:

AArch64

LLVM6:

all_8x8: ldr d0, [x0] umov w8, v0.b[0] umov w9, v0.b[1] tst w8, #0xff umov w10, v0.b[2] cset w8, ne tst w9, #0xff cset w9, ne tst w10, #0xff umov w10, v0.b[3] and w8, w8, w9 cset w9, ne tst w10, #0xff umov w10, v0.b[4] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[5] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[6] and w8, w9, w8 cset w9, ne tst w10, #0xff umov w10, v0.b[7] and w8, w9, w8 cset w9, ne tst w10, #0xff and w8, w9, w8 cset w9, ne and w0, w9, w8 ret any_8x8: ldr d0, [x0] umov w8, v0.b[0] umov w9, v0.b[1] orr w8, w8, w9 umov w9, v0.b[2] orr w8, w8, w9 umov w9, v0.b[3] orr w8, w8, w9 umov w9, v0.b[4] orr w8, w8, w9 umov w9, v0.b[5] orr w8, w8, w9 umov w9, v0.b[6] orr w8, w8, w9 umov w9, v0.b[7] orr w8, w8, w9 tst w8, #0xff cset w0, ne ret

Manually generated:

all_8x8: ldr d0, [x0] mov v0.d[1], v0.d[0] uminv b0, v0.16b fmov w8, s0 tst w8, #0xff cset w0, ne ret any_8x8: ldr d0, [x0] mov v0.d[1], v0.d[0] umaxv b0, v0.16b fmov w8, s0 tst w8, #0xff cset w0, ne ret

ARMv7+NEON

LLVM6:

all_8x8: vmov.i8 d0, #0x1 vldr d1, [r0] vtst.8 d0, d1, d0 vext.8 d1, d0, d0, #4 vand d0, d0, d1 vext.8 d1, d0, d0, #2 vand d0, d0, d1 vdup.8 d1, d0[1] vand d0, d0, d1 vmov.u8 r0, d0[0] and r0, r0, #1 bx lr any_8x8: vmov.i8 d0, #0x1 vldr d1, [r0] vtst.8 d0, d1, d0 vext.8 d1, d0, d0, #4 vorr d0, d0, d1 vext.8 d1, d0, d0, #2 vorr d0, d0, d1 vdup.8 d1, d0[1] vorr d0, d0, d1 vmov.u8 r0, d0[0] and r0, r0, #1 bx lr

Manually generated:

all_8x8: vldr d0, [r0] vpmin.u8 d16, d0, d16 vpmin.u8 d16, d16, d16 vpmin.u8 d0, d16, d16 vmov.u8 r0, d0[0] bx lr

any_8x8: vldr d0, [r0] vpmax.u8 d16, d0, d16 vpmax.u8 d16, d16, d16 vpmax.u8 d0, d16, d16 vmov.u8 r0, d0[0] bx lr

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Simon thank you for looking into this.

What we currently do in the Rust front-end is:

the work-around you propose
work-around the sub-optimal code generation by pattern-matching in our front-end for every case that we currently care a lot about (all 128-bit and 256-bit vector types that can be used as a mask, for x86/x86_64 and for sse2 and avx, and for arm+neon and aarch64+neon)

However, our workarounds only currently work if the whole program is compiled with appropriate features enabled including the std library, because in many cases we pass features directly to LLVM and whether some feature is enabled in some context is not something that the Rust front-end actually knows or can know, in particular when inlining is required to discover that.

So our current solution is far from satisfactory.

RKSimon commented 6 years ago

I'm sorry to say it doesn't make use of the assume calls at all, there might be something we can do to insert AssertSext ops, but I doubt it'll be straightforward.

This does seem similar to [Bug #15391] and [Bug #31600] which both discuss adding signext/zeroext attributes to boolean vector types.

If you're mostly interested in making use of pmovmskb then you can do that reasonably easily by something like the ir below, the codegen could still be improved a bit in places....

I use v16i8 reductions, mainly as v16i1 will probably legalize to them anyhow.

declare i8 @llvm.experimental.vector.reduce.or.i8.v16i8(<16 x i8>) declare i8 @llvm.experimental.vector.reduce.and.i8.v16i8(<16 x i8>) declare void @llvm.assume(i1)

declare void @a() declare void @b()

define void @stdsimd_all_8x16(<16 x i8> noalias nocapture dereferenceable(16) %ptr) unnamed_addr #0 { %v = load <16 x i8>, <16 x i8> %ptr, align 16 %t = trunc <16 x i8> %v to <16 x i1> %e1 = sext <16 x i1> %t to <16 x i8> %e2 = call i8 @llvm.experimental.vector.reduce.and.i8.v16i8(<16 x i8> %e1) %x = icmp ne i8 %e2, 0 br i1 %x, label %a, label %b a: call void @a() ret void b: call void @b() ret void }

.LCPI0_0: .zero 16,128 stdsimd_all_8x16: # @stdsimd_all_8x16 pushq %rax vmovdqa (%rdi), %xmm0 xorl %ecx, %ecx vpsllw $7, %xmm0, %xmm0 vpand .LCPI0_0(%rip), %xmm0, %xmm0 vpmovmskb %xmm0, %eax cmpl $65535, %eax # imm = 0xFFFF movl $-1, %eax cmovnel %ecx, %eax testb %al, %al je .LBB0_2 callq a popq %rax retq .LBB0_2: # %b callq b popq %rax retq

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

It would be great if there was a canonical way for front-ends to tell LLVM that a vector is a vector of booleans, that is, that all the bits on each lane are either all set or all cleared. This is what the sequence:

%t = trunc <16 x i8> %v to <16 x i1> %e1 = sext <16 x i1> %t to <16 x i8> %e2 = icmp eq <16 x i8> %v, %e1 %e3 = call i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1> %e2) call void @llvm.assume(i1 %e3)

is trying to do.

Whether a front-end should then insert something like this sequence before each operation on a boolean vector or not, I don't know. It seems that this might bloat the amount of LLVM-IR generated.

RKSimon commented 6 years ago

I'll take a look at this - I added the any_of/all_of reduction support to x86 a while ago, its probably just not liking how llvm.experimental.vector.reduce.and is expanded.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

assigned to @RKSimon

llvm / llvm-project