chandlerc commented 6 years ago


Bugzilla Link	37246
Version	trunk
OS	Linux
CC	@topperc,@gnzlbg,@RKSimon,@rotateright

Extended Description

For hypothetical memcmp code:

include

extern "C" int my_memcmp(const void lhs, const void rhs, size_t count ) { const unsigned char s1 = reinterpret_cast<const unsigned char>(lhs); const unsigned char s2 = reinterpret_cast<const unsigned char>(rhs);

const m128i s1chunk = _mm_loadu_si128(reinterpret_cast<const m128i>(lhs)); const m128i s2chunk = _mm_loadu_si128(reinterpret_cast<const m128i>(rhs));

constexpr unsigned char mode = _SIDD_UBYTE_OPS | // compare unsigned 8-bit characters _SIDD_CMP_EQUAL_EACH | // for equality _SIDD_NEGATIVE_POLARITY | // negate results _SIDD_LEAST_SIGNIFICANT;

unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); if (mask != 0) { return s1[mask] - s2[mask]; }

return 0; }

LLVM generates annoyingly suboptimal code:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 testl %ecx, %ecx je .LBB0_1 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

For comparison: https://godbolt.org/g/LiHXxk

Changing the code to use the flag-setting spelling doesn't really help:

if (_mm_cmpistrc(s1chunk, s2chunk, mode)) { unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); return s1[mask] - s2[mask]; }

LLVM now generates:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 jae .LBB0_1 pcmpistri $24, %xmm1, %xmm0 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

And for comparison: https://godbolt.org/g/UtYqu2

What we really should be generating:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 pcmpistri $24, (%rsi), %xmm0 jae .LBB0_1 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

ICC generates a weird self-move for %ecx? Surely that's not needed.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

This is Rust issue 53232: https://github.com/rust-lang/rust/issues/53232

Is there anything we can do in the Rust frontend to work around this?

If an _mm_cmpestrc() intrinsic is used to check whether a match was found, and then an _mm_cmpestrm() intrinsic is used to obtain the mask __m128i, the assembly that is generated by rustc contains a (V)PCMPESTRI instruction and a (V)PCMPESTRM instruction. Ideally, the assembly would have only one (V)PCMPESTRM instruction.

GCC 8.2.0 was able to coalesce the PCMPxSTRx instructions; it turned the _mm_cmpestrc()/_mm_cmpestrm() intrinsic calls to:

pcmpestrm   $0, %xmm1, %xmm2
jnc L27

.. and the _mm_cmpistrc()/_mm_cmpistrm() intrinsic calls to:

pcmpistrm   $0, 16(%rsp), %xmm4
jnc L29

while LLVM trunk does not(https://godbolt.org/g/nMVyqQ):

    movdqa  144(%rsp), %xmm0
    movdqa  128(%rsp), %xmm1
    pcmpistri       $0, %xmm1, %xmm0
    jae     .LBB1_5
    pcmpistrm       $0, %xmm1, %xmm0
    movd    %xmm0, %eax

For more info see the Rust issue and also the issue that triggered the report: https://github.com/shepmaster/jetscii/pull/24#issuecomment-411861393

topperc commented 6 years ago

Found another issue.

Right now clang always uses pmpistri for the flag instrinsics. Gcc and icc seem to be able to switch to using pmpcistrm if the mask result is being used by another intrinsic. Like some kind of STTNI aware CSE.

I hacked Chandler's function a little to demonstrate this.

https://godbolt.org/g/CUUUBH

chandlerc commented 6 years ago

The missed load folding was easily fixed.

Folding the comparison is quite a bit harder. I don't think I'll have time to finish it sadly.

Here is a sketch of how I think you could do this in case anyone wants to pick it up:

You'll want to do this inside X86DAGToDAGISel.
You'll add a custom fold when selecting an ISD::BRCOND node.
You'll want to match the pattern of (ISD::BRCOND (ISD::SUB (X86ISD::PCMP[EI]STR[IM] ...), C)).
For the index result comparison, you'll want to extract the flag immediate operand of the comparison and determine if it is comparing bytes or words.
Check whether the constant operand to the ISD::SUB is 16 if comparing bytes, or 8 if comparing words
If so, replace the SUB operand of BRCOND (for EFLAGS) with the EFLAGS value coming out of the PCMP... instruction (note, not the index value!)
Then look at the flags used by BRCOND. You'll need to write them to use CF and in the directionality that matches -- index != 16 -> CF set, index == 16 -> CF zero (i think). Something analogous for the mask variant.

Probably some cases you can't handle and will need to bail out on....

HTH whomever picks this up.

chandlerc commented 6 years ago

Craig pointed out that my first code snippet was wrong. Should be testing against 16 instead (since looking at 8-bit characters):

unsigned index = _mm_cmpistri(s1chunk, s2chunk, mode); if (index != 16) { return s1[index] - s2[index]; }

LLVM still generates suboptimal code:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 xorl %eax, %eax cmpl $16, %ecx je .LBB0_2 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax .LBB0_2: retq

And updated comparison: https://godbolt.org/g/sx5aDP

chandlerc commented 6 years ago

Since i wasn't clear, I'd much prefer that users write the first code sequence.

Duplicating the calls to the intrinsic and relying on them being folded seems quite bad. I'd rather we successfully fold tests against ecx into equivalent flags.

llvm / llvm-project

Poor lowering of SSE4.2 string intrinsics #36594

Extended Description

For hypothetical memcmp code:

include

include

return 0; }

LLVM generates annoyingly suboptimal code:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 testl %ecx, %ecx je .LBB0_1 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

Changing the code to use the flag-setting spelling doesn't really help:

if (_mm_cmpistrc(s1chunk, s2chunk, mode)) { unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); return s1[mask] - s2[mask]; }

LLVM now generates:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 jae .LBB0_1 pcmpistri $24, %xmm1, %xmm0 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

What we really should be generating:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 pcmpistri $24, (%rsi), %xmm0 jae .LBB0_1 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

Craig pointed out that my first code snippet was wrong. Should be testing against 16 instead (since looking at 8-bit characters):

unsigned index = _mm_cmpistri(s1chunk, s2chunk, mode); if (index != 16) { return s1[index] - s2[index]; }

LLVM still generates suboptimal code:

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 xorl %eax, %eax cmpl $16, %ecx je .LBB0_2 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax .LBB0_2: retq

llvm / llvm-project

Poor lowering of SSE4.2 string intrinsics #36594

Extended Description

For hypothetical memcmp code:

include

include

return 0; }

LLVM generates annoyingly suboptimal code:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 testl %ecx, %ecx je .LBB0_1 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

Changing the code to use the flag-setting spelling doesn't really help:

if (_mm_cmpistrc(s1chunk, s2chunk, mode)) { unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); return s1[mask] - s2[mask]; }

LLVM now generates:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 jae .LBB0_1 pcmpistri $24, %xmm1, %xmm0 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

What we really should be generating:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 pcmpistri $24, (%rsi), %xmm0 jae .LBB0_1 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

Craig pointed out that my first code snippet was wrong. Should be testing against 16 instead (since looking at 8-bit characters):

unsigned index = _mm_cmpistri(s1chunk, s2chunk, mode); if (index != 16) { return s1[index] - s2[index]; }

LLVM still generates suboptimal code:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 xorl %eax, %eax cmpl $16, %ecx je .LBB0_2 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax .LBB0_2: retq

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 testl %ecx, %ecx je .LBB0_1 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 jae .LBB0_1 pcmpistri $24, %xmm1, %xmm0 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 pcmpistri $24, (%rsi), %xmm0 jae .LBB0_1 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 xorl %eax, %eax cmpl $16, %ecx je .LBB0_2 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax .LBB0_2: retq