Scalars seem to always be chosen when vectorization cost is equal, even when already using vectors

llvmbot commented 5 years ago


Bugzilla Link	40148
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@alexey-bataev,@topperc,@hfinkel,@RKSimon,@rotateright

Extended Description

A cause of more i64x2 vector multiply bugs.

I think this is the higher scope of the bug, although I have a specific example.

Basically, what I see is that LLVM will always choose scalar, even when the cost model of vectorized is equal. While this is beneficial to prevent an unwanted vectorization, if you are already operating with vectors (most notable with vector extensions), LLVM will forcibly choose scalar, resulting in extraction.

Here, I am trying a few ways to get a pmuludq, as well as implementing vmull_u32, all in vector extensions.

This is for trunk. 7.0 fails to vectorize pmuludq_v2, probably because of bug 40032.

U64x2 pmuludq_v1(U64x2 top, U64x2 bot) { return (top & 0xFFFFFFFF) (bot & 0xFFFFFFFF); } U64x2 pmuludq_v2(U64x2 top, U64x2 bot) { return (U64x2) { (top[0] & 0xFFFFFFFF) (bot[0] & 0xFFFFFFFF), (top[1] & 0xFFFFFFFF) (bot[1] & 0xFFFFFFFF) }; } / ARM-style / U64x2 vmull_u32(U32x2 top, U32x2 bot) { return (U64x2) { (U64)bot[0] (U64)top[0], (U64)bot[1] * (U64)top[1] }; }

clang version 8.0.0 (trunk 350011) clang -m32 -O3 -msse4.1

pmuludq_v1: # @pmuludq_v1 pmuludq xmm0, xmm1 ret pmuludq_v2: # @pmuludq_v2 pmuludq xmm0, xmm1 ret vmull_u32: # @vmull_u32 pmovzxdq xmm1, qword ptr [esp + 4] # xmm1 = mem[0],zero,mem[1],zero pmovzxdq xmm0, qword ptr [esp + 12] # xmm0 = mem[0],zero,mem[1],zero pmuludq xmm0, xmm1 ret

clang -m64 -O3 -msse4.1 pmuludq_v1: # @pmuludq_v1 pmuludq xmm0, xmm1 ret pmuludq_v2: # @pmuludq_v2 movq rax, xmm0 mov eax, eax movq rcx, xmm1 mov ecx, ecx imul rcx, rax pextrq rax, xmm0, 1 mov eax, eax pextrq rdx, xmm1, 1 mov edx, edx imul rdx, rax movq xmm1, rdx movq xmm0, rcx punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0] ret vmull_u32: # @vmull_u32 pextrd eax, xmm0, 1 movd ecx, xmm0 pextrd edx, xmm1, 1 movd esi, xmm1 imul rcx, rsi imul rax, rdx movq xmm1, rax movq xmm0, rcx punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0] ret

Godbolt: https://godbolt.org/z/6tGRWn

In the first example, Clang generates the expected pmuludq instruction on both x86_64 and x86.

The second also generates the expected pmuludq on x86, but goes scalar on x86_64.

The third generates the expected pmovzxdq and pmuludq on x86, but again goes scalar on x86_64.

According to -Rpass-missed=".*", unlike x86, we get this in each of the x86_64 scalar examples:

"List vectorization was possible but not beneficial with cost 0 >= 0."

Since LLVM sees that there is no difference between a pmuludq and scalar, it chooses scalar.

If there is no difference in cost, LLVM should remain in either vector or scalar, or at least consider the cost of extraction if it isn't already.

Even if they are the same speed after extraction, the vectorized version should still be preferred because in the above example, the vectorized pmuludq_v1 takes up 5 bytes, while the scalar pmuludq_v2 is 55 bytes.

llvmbot commented 5 years ago

Very interesting output. It is very interesting how Clang decides that SSE2 should mask and SSE4 should not.

Also, GCC is generating complete and utter nonsense.

https://godbolt.org/z/H_tOi1

llvmbot commented 5 years ago

Hmm. With -msse2 instead of -msse4.1 on 32-bit, we get the same for pmuludq_v1, but generates this for pmuludq_v2 (code affected by bug 40142):

.LCPI1_0: .long 4294967295 # 0xffffffff .long 0 # 0x0 .long 4294967295 # 0xffffffff .long 0 # 0x0 pmuludq_v2: # @pmuludq_v2 movdqa xmm3, xmmword ptr [.LCPI1_0] # xmm3 = [4294967295,0,4294967295,0] movdqa xmm2, xmm1 punpckhqdq xmm1, xmm0 # xmm1 = xmm1[1],xmm0[1] punpcklqdq xmm2, xmm0 # xmm2 = xmm2[0],xmm0[0] pand xmm2, xmm3 pand xmm1, xmm3 movdqa xmm0, xmm2 punpckhqdq xmm2, xmm1 # xmm2 = xmm2[1],xmm1[1] punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0] pmuludq xmm2, xmm0 movdqa xmm0, xmm2 ret

There are no cost optimization notes.

llvmbot commented 5 years ago

Note: If it wasn't clear, I was using these typedefs:

typedef unsigned long long U64; typedef unsigned U32; typedef U64 U64x2 attribute((vector_size(16))); // basically m128i typedef U32 U32x2 attribute__((vector_size(8))); // basically __m64i

llvm / llvm-project

Scalars seem to always be chosen when vectorization cost is equal, even when already using vectors #39495

Extended Description