Open abadams opened 1 month ago
@llvm/issue-subscribers-backend-x86
Author: Andrew Adams (abadams)
Actually there's a 12-instruction version. You don't need to bother with the first 4 vperm2f128s, because you can account for it in the final 4 vperm2f128s. When you write this directly, llvm does the right thing, so I have my workaround. But ideally other ways of writing a four-way interleave should generate something similar.
#include <cstring>
typedef float floatx8_vec __attribute__((ext_vector_type(8)));
auto unpckl(floatx8_vec a, floatx8_vec b) {
return __builtin_shufflevector(a, b, 0, 8, 1, 9, 4, 12, 5, 13);
}
auto unpckh(floatx8_vec a, floatx8_vec b) {
return __builtin_shufflevector(a, b, 2, 10, 3, 11, 6, 14, 7, 15);
}
void interleave4_v3(float *a, float *b, float *c, float *d, float *out) {
floatx8_vec va, vb, vc, vd;
std::memcpy(&va, a, sizeof(va));
std::memcpy(&vb, b, sizeof(vb));
std::memcpy(&vc, c, sizeof(vc));
std::memcpy(&vd, d, sizeof(vd));
auto ac_lo = unpckl(va, vc);
auto ac_hi = unpckh(va, vc);
auto bd_lo = unpckl(vb, vd);
auto bd_hi = unpckh(vb, vd);
auto abcd_lo_lo = unpckl(ac_lo, bd_lo);
auto abcd_lo_hi = unpckh(ac_lo, bd_lo);
auto abcd_hi_lo = unpckl(ac_hi, bd_hi);
auto abcd_hi_hi = unpckh(ac_hi, bd_hi);
auto out0 = __builtin_shufflevector(abcd_lo_lo, abcd_lo_hi, 0, 1, 2, 3, 8, 9, 10, 11);
auto out1 = __builtin_shufflevector(abcd_hi_lo, abcd_hi_hi, 0, 1, 2, 3, 8, 9, 10, 11);
auto out2 = __builtin_shufflevector(abcd_lo_lo, abcd_lo_hi, 4, 5, 6, 7, 12, 13, 14, 15);
auto out3 = __builtin_shufflevector(abcd_hi_lo, abcd_hi_hi, 4, 5, 6, 7, 12, 13, 14, 15);
std::memcpy(out, &out0, sizeof(out0));
std::memcpy(out + 8, &out1, sizeof(out1));
std::memcpy(out + 16, &out2, sizeof(out2));
std::memcpy(out + 24, &out3, sizeof(out3));
}
interleave4_v3(float*, float*, float*, float*, float*):
vmovups ymm0, ymmword ptr [rdi]
vmovups ymm1, ymmword ptr [rsi]
vmovups ymm2, ymmword ptr [rdx]
vmovups ymm3, ymmword ptr [rcx]
vunpcklps ymm4, ymm0, ymm2
vunpckhps ymm0, ymm0, ymm2
vunpcklps ymm2, ymm1, ymm3
vunpckhps ymm1, ymm1, ymm3
vunpcklps ymm3, ymm4, ymm2
vunpckhps ymm2, ymm4, ymm2
vunpcklps ymm4, ymm0, ymm1
vunpckhps ymm0, ymm0, ymm1
vinsertf128 ymm1, ymm3, xmm2, 1
vinsertf128 ymm5, ymm4, xmm0, 1
vperm2f128 ymm2, ymm3, ymm2, 49
vperm2f128 ymm0, ymm4, ymm0, 49
vmovups ymmword ptr [r8], ymm1
vmovups ymmword ptr [r8 + 32], ymm5
vmovups ymmword ptr [r8 + 64], ymm2
vmovups ymmword ptr [r8 + 96], ymm0
vzeroupper
ret
Four-way interleaves with avx2 aren't generating good code. Consider the following:
With -O3 -mavx2 it generates good code for the interleave2 functions, but makes a mess with interleave4. It should generate 16 unpckh/unpckl/vperm2 instructions (18 cycles on skylake according to llvm-mca), but instead it generates 44 vshuf/vperm/blend instructions (32 cycles). See below (copy-pasted from godbolt):
If you force materialization of the intermediate values, you get the expected code (minus the spill/reload used to force materialization):
If this is hard to fix, I'd love a workaround that doesn't involve memory operations or inline assembly. Tagging @RKSimon as according to git history he has been working on x86 shuffle lowering most recently.