Open LebedevRI opened 5 years ago
This should be very good for x86 perf (replacing 2 ymm memops + a ymm shuffle with 2 xmm memops). Probably helps other targets too.
If we solve bug 41429, we have a vector select in the form of a shuffle in IR. Close enough to scalar select-store patterns to consider as a generic (DSE? earlyCSE?) optimization (see bug 39603) since it always eliminates ops?
If that's too far of a stretch, then we either have to enhance the SDAG load/store splitting/combining or make a custom pass to do this.
Extended Description
Split off from llvm/llvm-project#40774
https://godbolt.org/z/_n1ggH
void example(m256i * restrict dest, const m256i restrict a) { (dest)[2] = (a)[2]; (dest)[3] = (*a)[3]; }
Here we do not ever touch the low half of dest, and replace the high part of dest with high part of
a
.The naive asm could be:
vmovaps ymm0, ymmword ptr [rdi] vblendps ymm0, ymm0, ymmword ptr [rsi], 240 # ymm0 = ymm0[0,1,2,3],mem[4,5,6,7] vmovaps ymmword ptr [rdi], ymm0 vzeroupper
But we can also produce:
I'm not quite sure what are the exact criteria when that is profitable to do though.