[AVX,SSE] inefficient code generated for vector compares due to sext to i32 moved across phi node

llvmbot commented 12 years ago


Bugzilla Link	11730
Version	trunk
OS	All
Attachments	sext instructions immediately after vector compares, good code, sext instruction after phi node, bad code
Reporter	LLVM Bugzilla Contributor
CC	@lattner,@topperc,@efriedma-quic,@RKSimon,@rotateright

Extended Description

The attached two test cases demonstrate an interesting situation that leads to much worse code than usual for vector computation on SSE and AVX (at least).

As context, when doing vector comparisons with those targets, it's usually important to immediately sext the result of the comparison to an value. (Which is what the instructions actually return; the x86 code generator picks up on this pattern and then emits just the desired vector comparison instruction.)

In the attached test case, the originally-generated code had the sexts right after the vector compares, but then a later optimization pass noticed that two values feeding into a phi node both had sext to after them, so it removed the two original sexts and added a new one after the phi node. As a result, the x86 code generator doesn't pick up the pattern and generates very inefficient code that first painfully does a zext conversion to and then later painfully converts this back to the originally desired sext value.

The two attachments are the same except for one has the sexts placed back after the vector compares and the other has the one with the single sext after the phi node. When run through llc -mattr=+avx, the second one has big sequences of the following as a result:

movzbl  %r15b, %edx
vpinsrw $0, %edx, %xmm0, %xmm0
movzbl  %r12b, %edx
vpinsrw $1, %edx, %xmm0, %xmm0
movzbl  %bl, %edx
vpinsrw $2, %edx, %xmm0, %xmm0
movzbl  %r9b, %edx
vpinsrw $3, %edx, %xmm0, %xmm0
movzbl  %r10b, %edx
vpinsrw $4, %edx, %xmm0, %xmm0
movzbl  %r14b, %edx
vpinsrw $5, %edx, %xmm0, %xmm0
movzbl  %r11b, %edx
vpinsrw $6, %edx, %xmm0, %xmm0
movzbl  %sil, %edx
vpinsrw $7, %edx, %xmm0, %xmm6
vpextrw $1, %xmm6, %edx
vmovd   %xmm6, %esi
vmovd   %esi, %xmm0
vpinsrd $1, %edx, %xmm0, %xmm0
vpextrw $2, %xmm6, %edx
vpinsrd $2, %edx, %xmm0, %xmm0
vpextrw $3, %xmm6, %edx
vpinsrd $3, %edx, %xmm0, %xmm0
vpslld  $31, %xmm0, %xmm0
vpsrad  $31, %xmm0, %xmm7
vpextrw $5, %xmm6, %edx
vpextrw $4, %xmm6, %esi
vmovd   %esi, %xmm0
vpinsrd $1, %edx, %xmm0, %xmm0
vpextrw $6, %xmm6, %edx
vpinsrd $2, %edx, %xmm0, %xmm0
vpextrw $7, %xmm6, %edx
vpinsrd $3, %edx, %xmm0, %xmm0
vpslld  $31, %xmm0, %xmm0
vpsrad  $31, %xmm0, %xmm0
vinsertf128 $1, %xmm0, %ymm7, %ymm0

nikic commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#40155

topperc commented 5 years ago

Bad test case updated to work on trunk

llvmbot commented 11 years ago

A related problem is when you do sth like this:

pre: %mask_pre = fcmp slt <4 x float> %a, %b

head: %phi = phi <4 x i1> [%mask_pre, %pre], [%mask_body, %body] ... br ..., %body, %out

body: %sel = select <4 x i1> %phi, ..., ... ... %mask_body = ...

out: ...

Now, LLVM will emit code like this for the phi node: pslld $31, %xmm9 psrad $31, %xmm9 This is completely superfluous.

lattner commented 12 years ago

This might be best-handled in CodeGenPrepare.

llvmbot commented 12 years ago

Given a type T, the type it is legalized to depends only on T. This is fundamental to the design of the type legalizer. You can't make the decision depend on what kind of operation produced this type. I hope that's not what you are suggesting!

Not at all. I don't plan to change it.

If I understand right, the <8 x i1> is being passed between basic blocks as <8 x i32>, while within the blocks
<8 x i16> is expected.

Actually, its the other way around. v8i1 is promoted to v8i16, while it should have been promoted to v8i32.

So... why is it being passed between blocks as an
<8 x i32>? Maybe the code that made that decision is simply old and not aware of how types are legalized nowadays.

We also have the same problem with this code:

define <4 x double> @foo(<4 x double> %x, <4 x double> %y) { %min_is_x = fcmp ult <4 x double> %x, %y %min = select <4 x i1> %min_is_x, <4 x double> %x, <4 x double> %y ret <4 x double> %min }

v4i1 is legalized to v4i32 while it should have been legalized to v4i64.

llvmbot commented 12 years ago

Given a type T, the type it is legalized to depends only on T. This is fundamental to the design of the type legalizer. You can't make the decision depend on what kind of operation produced this type. I hope that's not what you are suggesting! If I understand right, the <8 x i1> is being passed between basic blocks as <8 x i32>, while within the blocks

<8 x i16> is expected. So... why is it being passed between blocks as an <8 x i32>? Maybe the code that made that decision is simply old and not aware of how types are legalized nowadays.

llvmbot commented 12 years ago

From what I can tell, this is a type-legalization problem. The type <8 x i1> is passed in a register (XMM!) between basic blocks. How should we legalize this type ? Well ... it depends. If this type is the result of a comparison of v8i16s, then it should be saved in a 128 bit register, but if the comparison is of v8i32s, then i256 should be used.

We can't solve this problem completely using the current design of the type-legalizer because it only looks at a basic block at a time. However, I think that we can make things better by changing some of the type-legalizer policies.

Currently vector elements are promoted until a wider type is found. In our case, v8i1 was widened to v8i16 which fit into an XMM register. Alternatively, we could have attempted to keep promoting it to v8i32, which would have been placed into a YMM register.

I will play around and try to see if promoting to the largest vector size is a better strategy.

llvm / llvm-project

[AVX,SSE] inefficient code generated for vector compares due to sext to i32 moved across phi node #12102

Extended Description