Trivial loads and stores are too aggressive/naive about using scatter/gather instructions

davezarzycki commented 6 years ago


Bugzilla Link	35170
Version	trunk
OS	All
CC	@ayalz,@topperc,@davezarzycki,@hfinkel,@RKSimon

Extended Description

Note: LLVM r317088

clang+llvm ToT built itself and -march=knl is way too aggressive/naive about using vector scatter/gather instructions during auto-vectorization. For example, consider llvm::DenseMap<...>::grow(unsigned int) [with code gen below).

This generates a large loop that moves a ton of scalars and pointers from the scalar domain into the vector domain, packs said scalars, and then uses vpscatterqq to store the scalars as scalars. If the vector scatter/gather instructions were magically more efficient than the scalar equivalents, then maybe all of the scalar-to-vector copying and packing would be worth it -- but they're not. The scatter/gather instructions exist to avoid movement of data between the scalar and vector domains (like what the compiler generated):

1068530: 48 8d 8a 10 fe ff ff lea -0x1f0(%rdx),%rcx 1068537: 48 8d 9a 20 fe ff ff lea -0x1e0(%rdx),%rbx 106853e: 48 8d aa 30 fe ff ff lea -0x1d0(%rdx),%rbp 1068545: 48 8d ba 40 fe ff ff lea -0x1c0(%rdx),%rdi 106854c: 4c 8d 82 50 fe ff ff lea -0x1b0(%rdx),%r8 1068553: 4c 8d b2 60 fe ff ff lea -0x1a0(%rdx),%r14 106855a: 4c 8d 8a 70 fe ff ff lea -0x190(%rdx),%r9 1068561: 4c 8d 92 80 fe ff ff lea -0x180(%rdx),%r10 1068568: c4 c1 f9 6e ca vmovq %r10,%xmm1 106856d: c4 c1 f9 6e d1 vmovq %r9,%xmm2 1068572: c5 e9 6c c9 vpunpcklqdq %xmm1,%xmm2,%xmm1 1068576: c4 c1 f9 6e d6 vmovq %r14,%xmm2 106857b: c4 c1 f9 6e d8 vmovq %r8,%xmm3 1068580: c5 e1 6c d2 vpunpcklqdq %xmm2,%xmm3,%xmm2 1068584: c4 e3 6d 38 c9 01 vinserti128 $0x1,%xmm1,%ymm2,%ymm1 106858a: c4 e1 f9 6e d7 vmovq %rdi,%xmm2 106858f: c4 e1 f9 6e dd vmovq %rbp,%xmm3 1068594: c5 e1 6c d2 vpunpcklqdq %xmm2,%xmm3,%xmm2 1068598: c4 e1 f9 6e db vmovq %rbx,%xmm3 106859d: c4 e1 f9 6e e1 vmovq %rcx,%xmm4 10685a2: c5 d9 6c db vpunpcklqdq %xmm3,%xmm4,%xmm3 10685a6: c4 e3 65 38 d2 01 vinserti128 $0x1,%xmm2,%ymm3,%ymm2 10685ac: 62 f3 ed 48 3a c9 01 vinserti64x4 $0x1,%ymm1,%zmm2,%zmm1 10685b3: 4c 8d 82 90 fe ff ff lea -0x170(%rdx),%r8 10685ba: 4c 8d 8a a0 fe ff ff lea -0x160(%rdx),%r9 10685c1: 4c 8d 92 b0 fe ff ff lea -0x150(%rdx),%r10 10685c8: 4c 8d b2 c0 fe ff ff lea -0x140(%rdx),%r14 10685cf: 48 8d 8a d0 fe ff ff lea -0x130(%rdx),%rcx 10685d6: 48 8d ba e0 fe ff ff lea -0x120(%rdx),%rdi 10685dd: 48 8d aa f0 fe ff ff lea -0x110(%rdx),%rbp 10685e4: 48 8d 9a 00 ff ff ff lea -0x100(%rdx),%rbx 10685eb: c4 e1 f9 6e d3 vmovq %rbx,%xmm2 10685f0: c4 e1 f9 6e dd vmovq %rbp,%xmm3 10685f5: c5 e1 6c d2 vpunpcklqdq %xmm2,%xmm3,%xmm2 10685f9: c4 e1 f9 6e df vmovq %rdi,%xmm3 10685fe: c4 e1 f9 6e e1 vmovq %rcx,%xmm4 1068603: c5 d9 6c db vpunpcklqdq %xmm3,%xmm4,%xmm3 1068607: c4 e3 65 38 d2 01 vinserti128 $0x1,%xmm2,%ymm3,%ymm2 106860d: c4 c1 f9 6e de vmovq %r14,%xmm3 1068612: c4 c1 f9 6e e2 vmovq %r10,%xmm4 1068617: c5 d9 6c db vpunpcklqdq %xmm3,%xmm4,%xmm3 106861b: c4 c1 f9 6e e1 vmovq %r9,%xmm4 1068620: c4 c1 f9 6e e8 vmovq %r8,%xmm5 1068625: c5 d1 6c e4 vpunpcklqdq %xmm4,%xmm5,%xmm4 1068629: c4 e3 5d 38 db 01 vinserti128 $0x1,%xmm3,%ymm4,%ymm3 106862f: 62 f3 e5 48 3a d2 01 vinserti64x4 $0x1,%ymm2,%zmm3,%zmm2 1068636: 4c 8d 82 10 ff ff ff lea -0xf0(%rdx),%r8 106863d: 4c 8d 8a 20 ff ff ff lea -0xe0(%rdx),%r9 1068644: 4c 8d 92 30 ff ff ff lea -0xd0(%rdx),%r10 106864b: 4c 8d b2 40 ff ff ff lea -0xc0(%rdx),%r14 1068652: 48 8d 8a 50 ff ff ff lea -0xb0(%rdx),%rcx 1068659: 48 8d ba 60 ff ff ff lea -0xa0(%rdx),%rdi 1068660: 48 8d aa 70 ff ff ff lea -0x90(%rdx),%rbp 1068667: 48 8d 5a 80 lea -0x80(%rdx),%rbx 106866b: c4 e1 f9 6e db vmovq %rbx,%xmm3 1068670: c4 e1 f9 6e e5 vmovq %rbp,%xmm4 1068675: c5 d9 6c db vpunpcklqdq %xmm3,%xmm4,%xmm3 1068679: c4 e1 f9 6e e7 vmovq %rdi,%xmm4 106867e: c4 e1 f9 6e e9 vmovq %rcx,%xmm5 1068683: c5 d1 6c e4 vpunpcklqdq %xmm4,%xmm5,%xmm4 1068687: c4 e3 5d 38 db 01 vinserti128 $0x1,%xmm3,%ymm4,%ymm3 106868d: c4 c1 f9 6e e6 vmovq %r14,%xmm4 1068692: c4 c1 f9 6e ea vmovq %r10,%xmm5 1068697: c5 d1 6c e4 vpunpcklqdq %xmm4,%xmm5,%xmm4 106869b: c4 c1 f9 6e e9 vmovq %r9,%xmm5 10686a0: c4 c1 f9 6e f0 vmovq %r8,%xmm6 10686a5: c5 c9 6c ed vpunpcklqdq %xmm5,%xmm6,%xmm5 10686a9: c4 e3 55 38 e4 01 vinserti128 $0x1,%xmm4,%ymm5,%ymm4 10686af: 62 f3 dd 48 3a db 01 vinserti64x4 $0x1,%ymm3,%zmm4,%zmm3 10686b6: 4c 8d 42 90 lea -0x70(%rdx),%r8 10686ba: 4c 8d 4a a0 lea -0x60(%rdx),%r9 10686be: 4c 8d 52 b0 lea -0x50(%rdx),%r10 10686c2: 48 8d 5a c0 lea -0x40(%rdx),%rbx 10686c6: 48 8d 4a d0 lea -0x30(%rdx),%rcx 10686ca: 48 8d 7a e0 lea -0x20(%rdx),%rdi 10686ce: 48 8d 6a f0 lea -0x10(%rdx),%rbp 10686d2: c4 e1 f9 6e e2 vmovq %rdx,%xmm4 10686d7: c4 e1 f9 6e ed vmovq %rbp,%xmm5 10686dc: c5 d1 6c e4 vpunpcklqdq %xmm4,%xmm5,%xmm4 10686e0: c4 e1 f9 6e ef vmovq %rdi,%xmm5 10686e5: c4 e1 f9 6e f1 vmovq %rcx,%xmm6 10686ea: c5 c9 6c ed vpunpcklqdq %xmm5,%xmm6,%xmm5 10686ee: c4 e3 55 38 e4 01 vinserti128 $0x1,%xmm4,%ymm5,%ymm4 10686f4: c4 e1 f9 6e eb vmovq %rbx,%xmm5 10686f9: c4 c1 f9 6e f2 vmovq %r10,%xmm6 10686fe: c5 c9 6c ed vpunpcklqdq %xmm5,%xmm6,%xmm5 1068702: c4 c1 f9 6e f1 vmovq %r9,%xmm6 1068707: c4 c1 f9 6e f8 vmovq %r8,%xmm7 106870c: c5 c1 6c f6 vpunpcklqdq %xmm6,%xmm7,%xmm6 1068710: c4 e3 4d 38 ed 01 vinserti128 $0x1,%xmm5,%ymm6,%ymm5 1068716: 62 f3 d5 48 3a e4 01 vinserti64x4 $0x1,%ymm4,%zmm5,%zmm4 106871d: c5 fc 46 c8 kxnorw %k0,%k0,%k1 1068721: 62 f2 fd 49 a1 04 0d vpscatterqq %zmm0,0x0(,%zmm1,1){%k1} 1068728: 00 00 00 00 106872c: c5 fc 46 c8 kxnorw %k0,%k0,%k1 1068730: 62 f2 fd 49 a1 04 15 vpscatterqq %zmm0,0x0(,%zmm2,1){%k1} 1068737: 00 00 00 00 106873b: c5 fc 46 c8 kxnorw %k0,%k0,%k1 106873f: 62 f2 fd 49 a1 04 1d vpscatterqq %zmm0,0x0(,%zmm3,1){%k1} 1068746: 00 00 00 00 106874a: c5 fc 46 c8 kxnorw %k0,%k0,%k1 106874e: 62 f2 fd 49 a1 04 25 vpscatterqq %zmm0,0x0(,%zmm4,1){%k1} 1068755: 00 00 00 00 1068759: 48 81 c2 00 02 00 00 add $0x200,%rdx 1068760: 49 83 c5 e0 add $0xffffffffffffffe0,%r13 1068764: 0f 85 c6 fd ff ff jne 1068530 <_ZN4llvm8DenseMapIPNS_8MCSymbolENS_14PointerIntPairIS2_Lj1EbNS_21PointerLikeTypeTraitsIS2_EENS_18PointerIntPairInfoIS2_Lj1ES5_EEEENS_12DenseMapInfoIS2_EENS_6detail12DenseMapPairIS2_S8_EEE4growEj+0x100>

davezarzycki commented 6 years ago

I took the time to learn how to use 'opt'. It seems like there are two problems here:

1) The loop vectorizer cost estimator assumes that GEPs are free, but in this scenario they compile down to non-free LEAs in the back end.

2) The loop vectorizer doesn't bother cost estimating the generated 'insertelement' IR instruction, which is non-free in the back end.

davezarzycki commented 6 years ago

Thanks Hal,

FWIW, I only meant that -fno-unroll-loops mostly de-bloats the auto-vectorization. The remaining vectorized code is still bad though.

As for -fno-vectorize and -fno-slp-vectorize not working consistency, I suspect that parts of the LLVM build system are dropping CXXFLAGS / CFLAGS on the floor.

hfinkel commented 6 years ago

I'll be doing some Xeon Skylake testing soon. Is there a way to prevent the auto-vectorizer from using the AVX512 scatter/gather instructions without disabling all of AVX512?

I believe that Craig is currently working on things related to this.

hfinkel commented 6 years ago

I think I figured out the bug.

It seems that scalar loop unrolling seems to be happening before auto-vectorization, rather than afterwards.

We should only be doing full loop unrolling (i.e., loop unrolling that completely removes the loop) before vectorization. That's part of our canonicalization process. Partial unrolling should happen later.

The problem with this approach is that reasonable scalar loop unrolling can generate code that is unreasonable to auto-vectorize. For example, a lot of C++ code uses std::pair or pair-like data structures, which tends to generate a lot of interleaved mutation patterns in loops that auto-vectorize poorly.

It would be better for auto-vectorization to take a try a loop and then fall back to scalar loop unrolling if the auto-vectorizer gives up.

That's how it works, or how it is supposed to work, and if it's not, we should figure out why. We should run:

Loop vectorization
SLP vectorization
Partial unrolling

(and I believe that's what the code in lib/Transforms/IPO/PassManagerBuilder.cpp does with the default options)

Additionally, at least when building LLVM+clang top-of-tree, it seems like -fno-vectorize and -fno-slp-vectorize don't reliably disable all auto-vectorization.

Aside from, potentially, simple memory-access combining in the backend, I don't know why this might be.

davezarzycki commented 6 years ago

I think I figured out the bug.

It seems that scalar loop unrolling seems to be happening before auto-vectorization, rather than afterwards. The problem with this approach is that reasonable scalar loop unrolling can generate code that is unreasonable to auto-vectorize. For example, a lot of C++ code uses std::pair or pair-like data structures, which tends to generate a lot of interleaved mutation patterns in loops that auto-vectorize poorly.

It would be better for auto-vectorization to take a try a loop and then fall back to scalar loop unrolling if the auto-vectorizer gives up.

Additionally, at least when building LLVM+clang top-of-tree, it seems like -fno-vectorize and -fno-slp-vectorize don't reliably disable all auto-vectorization.

davezarzycki commented 6 years ago

I did some triage of this bug. Unless I'm confusing my test results, -fno-vectorize and -fno-slp-vectorize still generated the terrible code.

I don't know what the feature name is for vectorizing trivial loads and stores, but that seems to be the problem. The following hack disabled the terrible code gen and let the auto-vectorizer continue to use scatter/gather instructions.

diff --git i/lib/Target/X86/X86TargetTransformInfo.cpp w/lib/Target/X86/X86TargetTransformInfo.cpp index 967d67a84bc..90b6e71b4a7 100644 --- i/lib/Target/X86/X86TargetTransformInfo.cpp +++ w/lib/Target/X86/X86TargetTransformInfo.cpp @@ -2462,6 +2462,7 @@ int X86TTIImpl::getGatherScatterOpCost(unsigned Opcode, Type *SrcVTy, // is better in the VariableMask case. if (ST->hasAVX512() && (VF == 2 || (VF == 4 && !ST->hasVLX()))) Scalarize = true;

Scalarize = true;

if (Scalarize) return getGSScalarCost(Opcode, SrcVTy, VariableMask, Alignment,

davezarzycki commented 6 years ago

I'll be doing some Xeon Skylake testing soon. Is there a way to prevent the auto-vectorizer from using the AVX512 scatter/gather instructions without disabling all of AVX512?

davezarzycki commented 6 years ago

As I suspected, changing DenseMap to memset() the empty value during growth ended the auto-vectorizer indigestion:

106c540: 62 f1 fe 48 7f 02 vmovdqu64 %zmm0,(%rdx) 106c546: 62 f1 fe 48 7f 42 01 vmovdqu64 %zmm0,0x40(%rdx) 106c54d: 48 83 ea 80 sub $0xffffffffffffff80,%rdx 106c551: 48 39 c2 cmp %rax,%rdx 106c554: 75 ea jne 106c540 <_ZN4llvm8DenseMapIPNS_8MCSymbolENS_14PointerIntPairIS2_Lj1EbNS_21PointerLikeTypeTraitsIS2_EENS_18PointerIntPairInfoIS2_Lj1ES5_EEEENS_12DenseMapInfoIS2_EENS_6detail12DenseMapPairIS2_S8_EEE4growEj+0x110>

davezarzycki commented 6 years ago

Note: the size of the loop body would shrink dramatically if the "add $0x200,%rdx" were vectorized. Nevertheless, the original observation still stands: vector scatter/gather isn't magically faster than the original scalar stores/loads.

llvm / llvm-project

Trivial loads and stores are too aggressive/naive about using scatter/gather instructions #34518

Extended Description