Open chandlerc opened 6 years ago
This is Rust issue 53232: https://github.com/rust-lang/rust/issues/53232
Is there anything we can do in the Rust frontend to work around this?
If an _mm_cmpestrc() intrinsic is used to check whether a match was found, and then an _mm_cmpestrm() intrinsic is used to obtain the mask __m128i, the assembly that is generated by rustc contains a (V)PCMPESTRI instruction and a (V)PCMPESTRM instruction. Ideally, the assembly would have only one (V)PCMPESTRM instruction.
GCC 8.2.0 was able to coalesce the PCMPxSTRx instructions; it turned the _mm_cmpestrc()/_mm_cmpestrm() intrinsic calls to:
pcmpestrm $0, %xmm1, %xmm2
jnc L27
.. and the _mm_cmpistrc()/_mm_cmpistrm() intrinsic calls to:
pcmpistrm $0, 16(%rsp), %xmm4
jnc L29
while LLVM trunk does not(https://godbolt.org/g/nMVyqQ):
movdqa 144(%rsp), %xmm0
movdqa 128(%rsp), %xmm1
pcmpistri $0, %xmm1, %xmm0
jae .LBB1_5
pcmpistrm $0, %xmm1, %xmm0
movd %xmm0, %eax
For more info see the Rust issue and also the issue that triggered the report: https://github.com/shepmaster/jetscii/pull/24#issuecomment-411861393
Found another issue.
Right now clang always uses pmpistri for the flag instrinsics. Gcc and icc seem to be able to switch to using pmpcistrm if the mask result is being used by another intrinsic. Like some kind of STTNI aware CSE.
I hacked Chandler's function a little to demonstrate this.
The missed load folding was easily fixed.
Folding the comparison is quite a bit harder. I don't think I'll have time to finish it sadly.
Here is a sketch of how I think you could do this in case anyone wants to pick it up:
Probably some cases you can't handle and will need to bail out on....
HTH whomever picks this up.
And updated comparison: https://godbolt.org/g/sx5aDP
Since i wasn't clear, I'd much prefer that users write the first code sequence.
Duplicating the calls to the intrinsic and relying on them being folded seems quite bad. I'd rather we successfully fold tests against ecx into equivalent flags.
Extended Description
For hypothetical memcmp code:
include
include
extern "C" int my_memcmp(const void lhs, const void rhs, size_t count ) { const unsigned char s1 = reinterpret_cast<const unsigned char>(lhs); const unsigned char s2 = reinterpret_cast<const unsigned char>(rhs);
const m128i s1chunk = _mm_loadu_si128(reinterpret_cast<const m128i>(lhs)); const m128i s2chunk = _mm_loadu_si128(reinterpret_cast<const m128i>(rhs));
constexpr unsigned char mode = _SIDD_UBYTE_OPS | // compare unsigned 8-bit characters _SIDD_CMP_EQUAL_EACH | // for equality _SIDD_NEGATIVE_POLARITY | // negate results _SIDD_LEAST_SIGNIFICANT;
unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); if (mask != 0) { return s1[mask] - s2[mask]; }
return 0; }
LLVM generates annoyingly suboptimal code:
my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 testl %ecx, %ecx je .LBB0_1 movl %ecx, %ecx movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq
For comparison: https://godbolt.org/g/LiHXxk
Changing the code to use the flag-setting spelling doesn't really help:
if (_mm_cmpistrc(s1chunk, s2chunk, mode)) { unsigned mask = _mm_cmpistri(s1chunk, s2chunk, mode); return s1[mask] - s2[mask]; }
LLVM now generates:
my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm1 pcmpistri $24, %xmm1, %xmm0 jae .LBB0_1 pcmpistri $24, %xmm1, %xmm0 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq
And for comparison: https://godbolt.org/g/UtYqu2
What we really should be generating:
my_memcmp: # @my_memcmp movdqu (%rdi), %xmm0 pcmpistri $24, (%rsi), %xmm0 jae .LBB0_1 movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %ecx subl %ecx, %eax retq .LBB0_1: xorl %eax, %eax retq
ICC generates a weird self-move for %ecx? Surely that's not needed.