9816046d-d175-4de2-95d1-ce389e5190e8 commented 5 years ago


Bugzilla Link	40976
Version	8.0
OS	Linux
Blocks	llvm/llvm-project#4440
Attachments	[preprocessed archhttps://user-images.githubusercontent.com/92601431/143758830-14fbd5d9-bdd9-4205-9608-78a23d48d450.gz)
CC	@davidbolvansky,@DMG862,@fhahn,@hfinkel,@kbeyls,@nickdesaulniers,@zygoloid,@sjoerdmeijer,@smithp35,@stephenhines

Extended Description

The RAID code in the Linux kernel is written in a way that should let the compiler autovectorize it with good results. When compiling the kernel with clang-8, this does not happen, only scalar code is emitted.

I have reproduced this with the attached file, also for comparison the same file at https://godbolt.org/z/7bMthX, which shows the difference between the gcc and the clang output.

nickdesaulniers commented 3 years ago

non-_restrict qualified xor_8regs_2()

Oh, we already have those, they're called xor_arm4regs_2().

nickdesaulniers commented 3 years ago

So today we have:

xor_neon_2() -> kernel_neon_begin() -> xor_8regs_2() -> if : -> then: -> else: (oops, did I need to save neon registers? No) -> kernel_neon_end()

But if we moved the distance check to C and dropped -ftree-vectorize, we could do:

xor_neon_2() -> if distance check explicitly written in C: -> then: -> kernel_neon_begin() -> _restrict qualified xor_8regs_2() -> kernel_neon_end() -> else: -> non-_restrict qualified xor_8regs_2()

And it should be roughly the same size (modulo calling convention overhead) since _restrict qualified xor_8regs_2() and non-_restrict qualified xor_8regs_2() should be roughly similar in size to the dual versioned version emitted by GCC.

nickdesaulniers commented 3 years ago

Thanks Dave for the info.

SLP does not have the ability to add runtime checks in llvm

So it sounds like LLVM SLP cannot "version" the loop, like GCC does?

Adding restrict would probably be my suggested way to fix it to be honest, so long as it's valid to use.

That's my main concern; none of this code looks like it can provide guarantees in the caller without excessive churn. Every struct xor_block_template would need to be modified (~23) to use __restrict in the definition, and yet even if only the arm32 call site could guarantee no overlap less than 8B, I'm not sure all other callers in the kernel could as well.

And even if we put a guard in before calling the restrict version, what do we do if we fail the guard (as in the distance between the two pointers is < 8)? Call the same function definition, minus _restrict qualified operands? That's going to duplicate a lot of code for little (even if definitions wrapped in a macro and stamped out). Then going through the effort to preserve NEON registers on context switch would be a waste, which is Arnd's concern.

We'd basically be coding up the distance check (that GCC emits when it versions the loop) in C, though that might actually be an optimization:

Arnd, consider the case where xor_neon_2() is called, but the pointers passed may overlap. kernel_neon_begin() will always enable preservation of the neon registers, even though they might not even end up being used (remember, GCC versioned the loop, so there's a vectorized part using NEON gated on a runtime check). If we moved the distance check into C, then we could "lower" the calls to kernel_neon_begin/kernel_neon_end to only run on the code path that actually used the neon registers. (if that makes sense)

In other words, we'd manually make 2 versions of xor_8regs_2 and friends, one with restrict qualifiers AND sink the calls to kernel_neon_begin/end into it, add a pointer distance check in xor_neon_2(), and then finally remove -ftree-vectorize (since we'd want one version that uses __restrict instead to provide only a vectorized loop, and one version that is scalar; not one function that is both. If you left -ftree-vectorize in, then w/ GCC you'd have a penalty since the non-restrict one would get two versions of the loop, and not preserve neon registers correctly. Am I missing something with this approach?

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 3 years ago

Hello

LLVM has two vectorizers - a loop vectorizer controlled by the #pragma clang loop directives and a SLP vectorizer that turns sequence of scalar instructions into equivalent vector instructions.

GCC is (essentially) performing (loop aware) SLP vectorization, and that is what adding restrict helps with. SLP does not have the ability to add runtime checks in llvm, so if there is no restrict it cannot legally turn the instructions into vector variants without risking different code.

That is what this example shows: https://godbolt.org/z/YnfWWT You can see that from the 16 loads, 8 xors and 8 stores we ends up with 4 vector loads, 2 xors and 2 stores, each doing 4 lanes each. It's turned the scalar instructions into vector instructions directly. Adding restrict would probably be my suggested way to fix it to be honest, so long as it's valid to use. The compiler will not need to use runtime checks and should produce efficient code.

The loop vectorizer in llvm works a little differently. For each instruction in the input loop it will try to produce a vector instruction. So an input with 16 loads will produce 16 vector loads. And because they don't access subsequent values on each iteration in this case, they create what is called an interleave group. These are like the VLD2/VLD3/VLD4 instructions in NEON, loading into interleaved lanes from the input. This shows that happening for an interleave size of 4: https://godbolt.org/z/PbsYK8

Unfortunately there is no "VLD8" though, so we give interleaving groups that large a high cost. In this case, because the interleaving load and the interleaving store can "cancel out", we could be producing more continuous loads. The interleaving loads and stores will generally not be as efficient as continuous loads, so that works out well in this case. This requires a more global view of the operations though to prove it is not costly, it's maybe something that vplan could help with in the future.

But I believe that is what is happening when the vector factor is forced as in https://godbolt.org/z/697W5G. There are lots of interleaving loads and stores created, which do not end up producing VLDn instructions but then the "interleaving's" (technically vector shuffles) are cancelled out and we are left with 16 vector loads, 8 vector xors and 8 vector stores.

It will depend on if this high vector factor is beneficial, but I would expect that adding restrict to get the SLP to be the cleanest version - and from the look of it closest to what the original authors intended from GCC codegen.

nickdesaulniers commented 3 years ago

More discussion upstream: https://lore.kernel.org/lkml/20201106051436.2384842-3-adrian.ratiu@collabora.com/

I'm inclined to close this bug. Adding

pragma clang loop vectorize(enable)

to the loop allows Clang to proceed with auto vectorization, overruling the cost model.

LKML thread: https://lore.kernel.org/lkml/20201106051436.2384842-3-adrian.ratiu@collabora.com/

9816046d-d175-4de2-95d1-ce389e5190e8 commented 5 years ago

I modified the example enough to build it for x86, and this verifies that the arm32 behavior is exactly the same as x86: with restrict, it uses vector registers, withouto restrict it produces completely scalar code:

https://godbolt.org/z/AYXkMH

As far as I can tell, arm32 is the only architecture that uses the generic asm/xor.h with autovectorization, all other architectures (including arm64) use an architecture specific implementation for providing asm/xor.h instead.

9816046d-d175-4de2-95d1-ce389e5190e8 commented 5 years ago

[linux: add restrict in asm-generichttps://user-images.githubusercontent.com/92601431/143758831-63b07a51-d952-4644-a299-54b66304285e.gz) Based on the observation in comment #2, I made a patch that adds restrict (which unlike restrict is allowed in gnu89) in all prototypes in include/asm-generic/xor.h.

I can confirm that this fixes the problem with clang-8.

nickdesaulniers commented 5 years ago

would a logical shift right set the carry clear flag?).

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/Cihfddaf.html

nickdesaulniers commented 5 years ago

    add     r3, r1, #&#8203;32
    add     ip, r2, #&#8203;32
if (((uintptr_t)(p1+1) - (uintptr_t)p2) == 0) if ((uintptr_t)(p2+1) - (uintptr_t)p1) == 0) goto conservative_unvectorized;

"compare p2 to (p1+1), if they're the same compare p1 to (p2+1), if they're the same then the regions potentially overlap, so do the load, xor (eor), store."

Sorry, should be (p1+8) and (p2+8), so check is that the ptrs are 8 elements apart for the vld1.32/veor/vst1.32 pairs. Looks like arm is ilp32 so sizeof(long) == 4.

And the check is not against zero, http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473c/CJAJIHAD.html indicates CC stands for Carry Clear (not Condition Code as I had assumed).

and the ... I left out is a lsr against r0 (would a logical shift right set the carry clear flag?). I'm trying to think through if that's a proper guard for p1 and p2 being at least bytes apart...

nickdesaulniers commented 5 years ago

Maybe another point is that not only could the pointers alias, but there doesn't seem to be any guarantees that that pointed too memory regions do not overlap. This feels vaguely reminiscent of memcpy vs memmove.

Also, looking at Arnd's example, gcc produces essentially 2 versions of the function xor_8regs_2. Basic block .L3 is a vectorized version, while .L5 is the non vectorized version. So how does the code emitted by GCC decide at runtime which path to take?

@ xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
@ r0: bytes
@ r1: p1
@ r2: p2

xor_8regs_2:
        add     r3, r1, #&#8203;32
        add     ip, r2, #&#8203;32
        cmp     r2, r3
        cmpcc   r1, ip
        ...
        bcc     .L2

Tea-leaves-interpretation:

if (((uintptr_t)(p1+1) - (uintptr_t)p2) == 0) if ((uintptr_t)(p2+1) - (uintptr_t)p1) == 0) goto conservative_unvectorized;

"compare p2 to (p1+1), if they're the same compare p1 to (p2+1), if they're the same then the regions potentially overlap, so do the load, xor (eor), store."

Which looks like some kind of check between the distance of the pointers based on the size of their pointed to type.

So I think it's a nice optimization, but I think incorrect, because it's not safe to optimize unless bytes is greater than the distance between the two pointers. But I would appreciate someone triple checking this interpretation.

The RAID code in the Linux kernel is written in a way that should let the compiler autovectorize it with good results.

I'm not certain C offers such guarantees. If you wanted assembly X, you should have written X in assembly. Then you can break from the C ABI and require invariants in the caller like "p1 and p2 not only don't alias, but don't overlap, and its on you if you mess up."

(if p1 == p2, xor'ing p1 and p2 could have been lowered to a memset(p1, 0, bytes)).

nickdesaulniers commented 5 years ago

Alias analysis strikes again. (The -Rpass-missed=.* complains about GVN, which I think is the giveaway that aliasing is related) If I qualify the pointers of the function parameters as restrict, LLVM will vectorize the case Kristof provided. (Adding -fno-strict-aliasing as the Linux kernel does makes no difference).

https://godbolt.org/z/jHngY2

Unfortunately, the kernel also uses -std=gnu89, where the restrict keyword is unrecognized.

adding -fno-strict-aliasing, the function attribute in LLVM IR (still) has:

attributes #0 = { ... "target-features"="...,+strict-align,..." ... }

and doesn't seem to affect generation of noalias attributes on the function parameters.

The function in question can't be inlined, as IIRC its referenced via function pointer in the kernel.

Richard, are there any other ways to denote that two function parameters of the same qualified type do not alias, and does -fno-strict-aliasing have implications on such annotations in LLVM IR?

kbeyls commented 5 years ago

The listing pointed to on godbolt is rather long. I thought I'd just extract what seems like a good representative of the problem here:

$ cat t.c void xor_8regs_2(unsigned long bytes, unsigned long p1, unsigned long p2) { long lines = bytes / (sizeof (long)) / 8;

do { p1[0] ^= p2[0]; p1[1] ^= p2[1]; p1[2] ^= p2[2]; p1[3] ^= p2[3]; p1[4] ^= p2[4]; p1[5] ^= p2[5]; p1[6] ^= p2[6]; p1[7] ^= p2[7]; p1 += 8; p2 += 8; } while (--lines > 0); }

$ /work/kristof/build-llvm/bin/clang --target=arm-linux-gnueabi -O2 -std=gnu89 -mfpu=neon -Wno-unused-value -mfloat-abi=softfp -ftree-vectorize t.c -S -Rpass-missed=. ... t.c:6:2: remark: the cost-model indicates that vectorization is not beneficial [-Rpass-missed=loop-vectorize] do { ^ t.c:6:2: remark: the cost-model indicates that interleaving is not beneficial [-Rpass-missed=loop-vectorize] t.c:2:1: remark: List vectorization was possible but not beneficial with cost 0 >= 0 [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, unsigned long p1, unsigned long *p2) ^

So, it does seem like this may be a cost model issue.

llvm / llvm-project

arm tree vectorizer fails to optimize kernel raid/xor code #40321

Extended Description

pragma clang loop vectorize(enable)