llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.65k stars 11.84k forks source link

Poor lowering of SSE4.2 string intrinsics #36594

Open chandlerc opened 6 years ago

chandlerc commented 6 years ago
Bugzilla Link 37246
Version trunk
OS Linux
CC @topperc,@gnzlbg,@RKSimon,@rotateright

Extended Description

For hypothetical memcmp code:

include

include

extern "C" int my_memcmp(const void lhs, const void rhs, size_t count ) { const unsigned char s1 = reinterpret_cast<const unsigned char>(lhs); const unsigned char s2 = reinterpret_cast<const unsigned char>(rhs);

const m128i s1chunk = _mm_loadu_si128(reinterpret_cast<const m128i>(lhs)); const m128i s2chunk = _mm_loadu_si128(reinterpret_cast<const m128i>(rhs));

constexpr unsigned char mode = _SIDD_UBYTE_OPS | // compare unsigned 8-bit characters _SIDD_CMP_EQUAL_EACH | // for equality _SIDD_NEGATIVE_POLARITY | // negate results _SIDD_LEAST_SIGNIFICANT;

unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); if (mask != 0) { return s1[mask] - s2[mask]; }

return 0; }

LLVM generates annoyingly suboptimal code:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 testl %ecx, %ecx je .LBB0_1 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

For comparison: https://godbolt.org/g/LiHXxk

Changing the code to use the flag-setting spelling doesn't really help:

if (_mm_cmpistrc(s1chunk, s2chunk, mode)) { unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); return s1[mask] - s2[mask]; }

LLVM now generates:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 jae .LBB0_1 pcmpistri $24, %xmm1, %xmm0 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

And for comparison: https://godbolt.org/g/UtYqu2

What we really should be generating:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 pcmpistri $24, (%rsi), %xmm0 jae .LBB0_1 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq

ICC generates a weird self-move for %ecx? Surely that's not needed.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

This is Rust issue 53232: https://github.com/rust-lang/rust/issues/53232

Is there anything we can do in the Rust frontend to work around this?


If an _mm_cmpestrc() intrinsic is used to check whether a match was found, and then an _mm_cmpestrm() intrinsic is used to obtain the mask __m128i, the assembly that is generated by rustc contains a (V)PCMPESTRI instruction and a (V)PCMPESTRM instruction. Ideally, the assembly would have only one (V)PCMPESTRM instruction.

GCC 8.2.0 was able to coalesce the PCMPxSTRx instructions; it turned the _mm_cmpestrc()/_mm_cmpestrm() intrinsic calls to:

pcmpestrm   $0, %xmm1, %xmm2
jnc L27

.. and the _mm_cmpistrc()/_mm_cmpistrm() intrinsic calls to:

pcmpistrm   $0, 16(%rsp), %xmm4
jnc L29

while LLVM trunk does not(https://godbolt.org/g/nMVyqQ):

    movdqa  144(%rsp), %xmm0
    movdqa  128(%rsp), %xmm1
    pcmpistri       $0, %xmm1, %xmm0
    jae     .LBB1_5
    pcmpistrm       $0, %xmm1, %xmm0
    movd    %xmm0, %eax

For more info see the Rust issue and also the issue that triggered the report: https://github.com/shepmaster/jetscii/pull/24#issuecomment-411861393

topperc commented 6 years ago

Found another issue.

Right now clang always uses pmpistri for the flag instrinsics. Gcc and icc seem to be able to switch to using pmpcistrm if the mask result is being used by another intrinsic. Like some kind of STTNI aware CSE.

I hacked Chandler's function a little to demonstrate this.

https://godbolt.org/g/CUUUBH

chandlerc commented 6 years ago

The missed load folding was easily fixed.

Folding the comparison is quite a bit harder. I don't think I'll have time to finish it sadly.

Here is a sketch of how I think you could do this in case anyone wants to pick it up:

Probably some cases you can't handle and will need to bail out on....

HTH whomever picks this up.

chandlerc commented 6 years ago

Craig pointed out that my first code snippet was wrong. Should be testing against 16 instead (since looking at 8-bit characters):

unsigned index = _mm_cmpistri(s1chunk, s2chunk, mode); if (index != 16) { return s1[index] - s2[index]; }

LLVM still generates suboptimal code:

my_memcmp: # @​my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 xorl %eax, %eax cmpl $16, %ecx je .LBB0_2 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax .LBB0_2: retq

And updated comparison: https://godbolt.org/g/sx5aDP

chandlerc commented 6 years ago

Since i wasn't clear, I'd much prefer that users write the first code sequence.

Duplicating the calls to the intrinsic and relying on them being folded seems quite bad. I'd rather we successfully fold tests against ecx into equivalent flags.