Clang/LLVM won't generate avx-512 moves even when using intrinsic

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

28.45k stars 11.76k forks source link

Clang/LLVM won't generate avx-512 moves even when using intrinsic #68311

Open danilaml opened 1 year ago

danilaml commented 1 year ago

For the following C/C++ code:

#include <immintrin.h>

typedef long long8 __attribute__((vector_size(8 * sizeof(long))));

void bar(long8 *a) {
  long8 val = {-1, -1, -1, -1, -1, -1, -1, -1};
  a[0] = val;
  a[1] = val;
  return;
}

void baz(long8 *a) {
  long8 val = {-1, -1, -1, -1, -1, -1, -1, -1};
  _mm512_store_epi64(a, val);
  _mm512_store_epi64(a+1, val); // comment to generate zmm move
  return;
}

Clang with -O3 -mcpu=icelake-server -force-vector-width=512 -print-after-all -debug would generate the following assembly:

bar(long __vector(8)*):                           # @bar(long __vector(8)*)
        vpcmpeqd        %ymm0, %ymm0, %ymm0
        vmovdqa %ymm0, 96(%rdi)
        vmovdqa %ymm0, 64(%rdi)
        vmovdqa %ymm0, 32(%rdi)
        vmovdqa %ymm0, (%rdi)
        vzeroupper
        retq
baz(long __vector(8)*):                           # @baz(long __vector(8)*)
        vpcmpeqd        %ymm0, %ymm0, %ymm0
        vmovdqa %ymm0, 96(%rdi)
        vmovdqa %ymm0, 64(%rdi)
        vmovdqa %ymm0, 32(%rdi)
        vmovdqa %ymm0, (%rdi)
        vzeroupper
        retq

Godbolt link: https://godbolt.org/z/7fTPjfff8

But if you comment the second store in the baz the zmm store is generated. Looks like when there are several consecutive stores, they get folded into memset intrinsic on IR, which is then expanded during selection dag creation using preferred vector width, and not actually requested one. But even ignoring that, I'm pretty sure that using zmm stores in this case is more efficient that ymm even on targets that prefer 256 bit vectors. Especially if the dest is aligned by 64 bytes.

Perhaps changing this line to Subtarget.useAVX512Regs()? https://github.com/llvm/llvm-project/blob/ca611affd3e5dfe00e6ebe0488994bf93c2d135c/llvm/lib/Target/X86/X86ISelLoweringCall.cpp#L285

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86

For the following C/C++ code: ```cpp #include <immintrin.h> typedef long long8 __attribute__((vector_size(8 * sizeof(long)))); void bar(long8 *a) { long8 val = {-1, -1, -1, -1, -1, -1, -1, -1}; a[0] = val; a[1] = val; return; } void baz(long8 *a) { long8 val = {-1, -1, -1, -1, -1, -1, -1, -1}; _mm512_store_epi64(a, val); _mm512_store_epi64(a+1, val); // comment to generate zmm move return; } ``` Clang with `-O3 -mcpu=icelake-server -force-vector-width=512 -print-after-all -debug` would generate the following assembly: ```asm bar(long __vector(8)*): # @bar(long __vector(8)*) vpcmpeqd %ymm0, %ymm0, %ymm0 vmovdqa %ymm0, 96(%rdi) vmovdqa %ymm0, 64(%rdi) vmovdqa %ymm0, 32(%rdi) vmovdqa %ymm0, (%rdi) vzeroupper retq baz(long __vector(8)*): # @baz(long __vector(8)*) vpcmpeqd %ymm0, %ymm0, %ymm0 vmovdqa %ymm0, 96(%rdi) vmovdqa %ymm0, 64(%rdi) vmovdqa %ymm0, 32(%rdi) vmovdqa %ymm0, (%rdi) vzeroupper retq ``` Godbolt link: https://godbolt.org/z/7fTPjfff8 But if you comment the second store in the `baz` the zmm store is generated. Looks like when there are several consecutive stores, they get folded into `memset` intrinsic on IR, which is then expanded during selection dag creation using preferred vector width, and not actually requested one. But even ignoring that, I'm pretty sure that using zmm stores in this case is more efficient that ymm even on targets that prefer 256 bit vectors. Especially if the dest is aligned by 64 bytes. Somewhat related to https://github.com/llvm/llvm-project/issues/42585 Perhaps changing this line to `Subtarget.useAVX512Regs()`? https://github.com/llvm/llvm-project/blob/ca611affd3e5dfe00e6ebe0488994bf93c2d135c/llvm/lib/Target/X86/X86ISelLoweringCall.cpp#L285

phoebewang commented 1 year ago

But even ignoring that, I'm pretty sure that using zmm stores in this case is more efficient that ymm even on targets that prefer 256 bit vectors. Especially if the dest is aligned by 64 bytes.

We support prefer-vector-width in a function granularity. It means we cannot selectively generate ZMM instructions for some efficient instructions but YMM for the rest.

danilaml commented 1 year ago

@phoebewang This attribute doesn't work with inlining (also, clang's min_vector_width appears to be ignored completely in my tests, see https://github.com/llvm/llvm-project/issues/60946, but that's a separate issue, I'm more interested in IR). As far as I'm aware, only "min-legal-vector-width" is supposed to be properly supported by inlining, but this property is ingored by the linked check in call lowering. See: https://godbolt.org/z/E4Wx8chqG

@topperc perhaps could explain better, since they added this attribute in https://reviews.llvm.org/D42724 and help me understand how it works.

phoebewang commented 1 year ago

We have some intrinsics like llvm.memset that have special handling in SelectionDAGBuilder. For llvm.memset, it allows target code to determine the best vector type. It is much similar to #60946 that optimizations (this time it's in backend) honor prefer_vector_width and ignore min_vector_width. The behavior meets the expectation of the design. As I explained in #60946, min_vector_width is not designed to override prefer_vector_width. It only works during ISelLowering when 1) we have 512-bit vector types in IR/Nodes 2) min_vector_width > prefer_vector_width. In this case, we build 256-bit nodes at the beginning of ISel. So we don't have a chance to lower it to ZMM instructions.

danilaml commented 1 year ago

@phoebewang that seems wrong. What is the purpose of min-vector-wdtih when using avx512 intrinsics? Why does frontend annotate the function with it if it's just ignored? I don't think it should override prefer_vector_width, I think it should override the check in iselLowering. Because at the moment the codegen is really inconsistent (use one intrinsic - get zmm move, use two - get 4 ymm moves). Is there any case where exaping memset (or other mem intrincs) using zmm would be slower than ymm? Especially when specifically asked for, via check in Subtarget.useAVX512Regs()?

At the moment llvm just generates noticeably slower code (up to 20% in my benchmarks) and there is no way to nudge it into using better instructions, short of using inline asm (or disabling MemCpyOpt, or applying some code hacks).

phoebewang commented 1 year ago

What is the purpose of min-vector-wdtih when using avx512 intrinsics? Why does frontend annotate the function with it if it's just ignored?

The short answer, for ABI. 512-bit vector passing/returning agruments must go through ZMM registers. Other cases are out of the scope of the design.

I don't think it should override prefer_vector_width, I think it should override the check in iselLowering. Because at the moment the codegen is really inconsistent (use one intrinsic - get zmm move, use two - get 4 ymm moves).

The root difference comes from front-end https://godbolt.org/z/bjnoz7aG1 As I explained above, we have more chance to optimize llvm.memset

Is there any case where exaping memset (or other mem intrincs) using zmm would be slower than ymm? Especially when specifically asked for, via check in Subtarget.useAVX512Regs()?

I don't have an answer for it without measuring on physical machines. You are welcome to improve it if you have strong data support. Note, we need to consider different targets to avoid regressions. If using ZMM register is always a win, we can simply remove this condition rather than relying min_vector_width.

At the moment llvm just generates noticeably slower code (up to 20% in my benchmarks) and there is no way to nudge it into using better instructions, short of using inline asm (or disabling MemCpyOpt, or applying some code hacks).

The simple way is to set "prefer-vector-width"="512" for the specific function. It is a typical scenario you prefer 512 to 256 :)

danilaml commented 1 year ago

@phoebewang if it's "just" for ABI when why are AVX-512 intrinsics annotated with this attribute?

The root difference comes from front-end https://godbolt.org/z/bjnoz7aG1 As I explained above, we have more chance to optimize llvm.memset

Not sure where are you getting "frontend" from. Folding consecutive stores into memset is a backend optimization. Have you looked at my second godbolt link? There is no frontend there, just pure IR: https://godbolt.org/z/E4Wx8chqG (runs clang to show asm, but you can just switch to opt and see that it folds stores to memset that will be expanded via ymm)

If using ZMM register is always a win, we can simply remove this condition rather than relying min_vector_width.

At least it's always a win for sized from 64 to some threshold (tested to about 16 KB). And with aligned enough data. (didn't check the unaligned case, since the function I've benchmarked has a preloop to align pointers).

The simple way is to set "prefer-vector-width"="512" for the specific function. It is a typical scenario you prefer 512 to 256 :)

This doesn't work, as demonstrated by the godbolt link above. Only min-legal-vector-width survives inlining. "prefer-vector-width" really only works as a global option, or when you are writing application code maybe. You can't use that for library function (or at least you are not in control of the code that is going to call it). Not to mention that there doesn't seem to be a way to do that in clang, so someone trying to use avx512 intrinsics still can't make compiler generate intended asm without relying on hacks, like making a separate TU with different options (and good luck if you need to mix function with different prefer-vector-width attributes). Although I'm more concerned by IR.

phoebewang commented 1 year ago

@phoebewang if it's "just" for ABI when why are AVX-512 intrinsics annotated with this attribute?

Because all of these intrinsics have at least one 512-bit argument.

Not sure where are you getting "frontend" from. Folding consecutive stores into memset is a backend optimization.

Ok, my fault. I sometimes take passes before "-emit-llvm" as "frontend". Actually it's a middle end pass "MemCpyOpt". Anyway, we can't call it backend optimization. Or just like me, you used to take all passes after frontend as backend :)

Have you looked at my second godbolt link? There is no frontend there, just pure IR: https://godbolt.org/z/E4Wx8chqG (runs clang to show asm, but you can just switch to opt and see that it folds stores to memset that will be expanded via ymm)

IR has the same problem as C code, i.e., the "MemCpyOptPass" conditionally turns store insts into memset. That's the root cause you observed inconsistency.

At least it's always a win for sized from 64 to some threshold (tested to about 16 KB).

What target you are testing. My concern is it might win on latest hardware but lose on old ones.

This doesn't work, as demonstrated by the godbolt link above. Only min-legal-vector-width survives inlining.

It's just a difference in merging strategy between "min-legal-vector-width" and "prefer-vector-width". "min-legal-vector-width" is not designed for optimization and should not used for it.

"prefer-vector-width" really only works as a global option, or when you are writing application code maybe.

I don't get the point. "prefer-vector-width" is a function attribute, why do you think it's a global option. You are free to manipulate it in the IR. Or you mean the frontend lacks a attribute support like min_vector_width(512) for C/C++ code?

You can't use that for library function (or at least you are not in control of the code that is going to call it).

Isn't the library function always in a different TU with caller and vice versa? I don't understand according to the context. Shouldn't happen to work even it's "a global option"?

Not to mention that there doesn't seem to be a way to do that in clang

This can be solved by adding a prefer_vector_width(N) in the frontend.

so someone trying to use avx512 intrinsics still can't make compiler generate intended asm without relying on hacks, like making a separate TU with different options (and good luck if you need to mix function with different prefer-vector-width attributes)

I don't see any risk to mix function with different prefer-vector-width attributes. Generating YMM instructions or ZMM ones in function body won't affect the correctness. It's different from arugments, that's why we need "min-legal-vector-width" for ABI. So it is just a thing of optimization and should be only solved with optimization option, i.e., using prefer-vector-width for given functions.

Do we have no other workarounds? No. Since the root cause is about "MemCpyOptPass", we can try to disable it for given scenarios. I found a simple way to do so when reading the code https://godbolt.org/z/G5c1xPe3E, we can also adding a knob to do it as well.

danilaml commented 1 year ago

Because all of these intrinsics have at least one 512-bit argument.

But they are "alwaysinline". They probably disappear even before reaching IR.

Anyway, we can't call it backend optimization. Or just like me, you used to take all passes after frontend as backend :)

Ok. It's more like middle end anyway.

IR has the same problem as C code, i.e., the "MemCpyOptPass" conditionally turns store insts into memset. That's the root cause you observed inconsistency.

Yes and no. The root cause is memset expansion done by isel. LLVM inconsistently "splits" memset into ymm stores, but doesn't split IR stores (as it should, since it's annotated by min-legal-vector-width).

What target you are testing. My concern is it might win on latest hardware but lose on old ones.

So far, I've run the tests on cascadelake and icelake server CPUs. I'll check on skylake-x later (if it's a win there, then it's definitely not worse on all others I presume).

It's just a difference in merging strategy between "min-legal-vector-width" and "prefer-vector-width".

There is no merge strategy for prefer-vector-width. Are you suggesting that it should be added (same as min-legal-vector-width)?

"min-legal-vector-width" is not designed for optimization and should not used for it.

I got impression from @topperc that it is supposed to be used for intrinsics as well. In fact, the original phab review (https://reviews.llvm.org/D42724) has this in its description:

The idea is that this would be set based on ABI requirements, intrinsics or explicit vector types being used, maybe simd pragmas, etc

Isn't the library function always in a different TU with caller and vice versa? I don't understand according to the context. Shouldn't happen to work even it's "a global option"?

Not if they are supposed to be always_inline (or considered for inlining).

This can be solved by adding a prefer_vector_width(N) in the frontend.

Where? You don't know where the function that needs to preserve its wide stores will be inlined beforehand. One can reimplement/copy min-legal-vector-width inlining strategy for prefer* attr but I fear there may be some unintended consequences. It's also not exactly true - I don't really want to override preferred vector. I simply want the explicit IR/intrinsics I used to not be split further, without dropping down to inline asm (which would be incredibly clunky for this simple case).

I don't see any risk to mix function with different prefer-vector-width attributes

Because they don't mix currently when inlining is considered. Also, prefer-vetor-width might have unwanted implication on other code - i.e. on skylake-x it may avx-512 downclocking issues due to avx-512 code generation (which doesn't or only slightly affect load/store instructions).

Do we have no other workarounds? No. Since the root cause is about "MemCpyOptPass", we can try to disable it for given scenarios. I found a simple way to do so when reading the code https://godbolt.org/z/G5c1xPe3E, we can also adding a knob to do it as well.

I don't see how MemCpyOptPass is the issue. It generates memset, sure. This in and of itself is fine. The issue happens later, when memset is expanded not using the IR it was created from (and the IR that's slower as well).

I found a simple way to do so when reading the code https://godbolt.org/z/G5c1xPe3E, we can also adding a knob to do it as well.

This generates nontemporal moves which has all kinds of implications, besides just being different asm.

phoebewang commented 1 year ago

But they are "alwaysinline". They probably disappear even before reaching IR.

That's probably the reason why we need to merge min-legal-vector-width into the caller.

Yes and no. The root cause is memset expansion done by isel. LLVM inconsistently "splits" memset into ymm stores, but doesn't split IR stores (as it should, since it's annotated by min-legal-vector-width).

I don't see it a problem. "memset expansion done by isel" is an optimization. It's common an optimization only recognizes some of IR instructions.

So far, I've run the tests on cascadelake and icelake server CPUs. I'll check on skylake-x later (if it's a win there, then it's definitely not worse on all others I presume).

Per to discussions when adding the code, Skylake and Icelake would have better performance with 256-bit loads/stores.

I got impression from @topperc that it is supposed to be used for intrinsics as well.

Right, besides ABI, many intrinsics' implementation are mapped to single instructions directly. That saids, min-legal-vector-width is required for them, otherwise, compiler crash during ISelLowering. OTOH, non directly mapped intrinsics can be lowered without min-legal-vector-width. In a word, here it still emphasizes it's used for functional correctness. It's not designed for arbitrary use.

Not if they are supposed to be always_inline (or considered for inlining).

Right, I subconsciously think functions are only built into the library binary.

Where? You don't know where the function that needs to preserve its wide stores will be inlined beforehand. One can reimplement/copy min-legal-vector-width inlining strategy for prefer* attr but I fear there may be some unintended consequences.

Because they don't mix currently when inlining is considered. Also, prefer-vetor-width might have unwanted implication on other code - i.e. on skylake-x it may avx-512 downclocking issues due to avx-512 code generation (which doesn't or only slightly affect load/store instructions).

That's true, but min-legal-vector-width has the same problem, see https://godbolt.org/z/P9eje6z68

It's also not exactly true - I don't really want to override preferred vector. I simply want the explicit IR/intrinsics I used to not be split further, without dropping down to inline asm (which would be incredibly clunky for this simple case).

I don't see how MemCpyOptPass is the issue. It generates memset, sure. This in and of itself is fine. The issue happens later, when memset is expanded not using the IR it was created from (and the IR that's slower as well).

This is just the difference of perspective we see it. It doesn't look to me an issue that preferred vector affects intrinsics lowering, as long as there's no correctness issue. Whether it is a valid optimization or not is another topic.

This generates nontemporal moves which has all kinds of implications, besides just being different asm.

Not sure. The asm seems generated as expected.

danilaml commented 1 year ago

That's probably the reason why we need to merge min-legal-vector-width into the caller.

Why? The backend would just legalize it using prefer vector width, like with memset. Won't break anything.

Per to discussions when adding the code, Skylake and Icelake would have better performance with 256-bit loads/stores.

Don't see any discussion or link to benchmarks in the bug. At most, you could infer from one comment that skylake would be worse with unaligned 512 ops vs two unaligned 256 ops. Maybe, as I said, in my benchmark I have a preloop that aligns the data, so the code that gets folded is actually align 64. There is a TODO in this code. Do you think changing it to ("prefer-vector-width" OR align >= 64) be enough?

That saids, min-legal-vector-width is required for them, otherwise, compiler crash during ISelLowering.

For some specific ones (i.e. the ones represented by intrinsis in IR), maybe. General IR ones (i.e. that map to regular instructions on vectors) are just legalized by splitting.

That's true, but min-legal-vector-width has the same problem, see https://godbolt.org/z/P9eje6z68

Wouldn't call it the same. It's similar but It's smaller in scope and easier to manage, since you need to explicitely use wide vectors in the code most o the time to see this effect (vectorizers can sometime produce them too, but usually other parts of the code adher to pref-vector-width).

It doesn't look to me an issue that preferred vector affects intrinsics lowering, as long as there's no correctness issue. Whether it is a valid optimization or not is another topic.

I see it as a problem because there is no other (sane) way to generate the desired code otherwise and the part that is supposed to do that (intrinsics) does so inconsistently (can randomly do or do not simply depending on wether memset folding triggers). And this leads to a noticeable performace regressions (15-20% range).

Not sure. The asm seems generated as expected.

asm from the godbolt link:

foo:                                    # @foo
.Lfoo$local:
        vpternlogd      $255, %zmm0, %zmm0, %zmm0
        vmovntdq        %zmm0, (%rdi)
        vmovntdq        %zmm0, 64(%rdi)
        vzeroupper
        retq

https://www.felixcloutier.com/x86/movntdq

phoebewang commented 1 year ago

Why? The backend would just legalize it using prefer vector width, like with memset. Won't break anything.

Because some intrinsics require min-legal-vector-width set to the correct value. prefer-vector-width will force 512-bit vector illegal and result in lowering fail without min-legal-vector-width correctly set. https://godbolt.org/z/xvxqY3szb

I see it as a problem because there is no other (sane) way to generate the desired code otherwise and the part that is supposed to do that (intrinsics) does so inconsistently (can randomly do or do not simply depending on wether memset folding triggers). And this leads to a noticeable performace regressions (15-20% range).

Doesn't make sense to me. The problem is the current "optimization" has regressions on some target as you described. You can tune it for specific targets like many other tuning we have done. The most similar one is https://reviews.llvm.org/D134982 It's not fair to call it a problem to an independent attribute.

danilaml commented 1 year ago

Because some intrinsics require min-legal-vector-width set to the correct value. prefer-vector-width will force 512-bit vector illegal and result in lowering fail without min-legal-vector-width correctly set. https://godbolt.org/z/xvxqY3szb

I've already addressed this in

For some specific ones (i.e. the ones represented by intrinsis in IR), maybe. General IR ones (i.e. that map to regular instructions on vectors) are just legalized by splitting

Doesn't make sense to me. The problem is the current "optimization" has regressions on some target as you described. You can tune it for specific targets like many other tuning we have done. The most similar one is https://reviews.llvm.org/D134982 It's not fair to call it a problem to an independent attribute.

I'm a bit lost. What doesn't make sense? What problem to an independent attribute? I only said that I believe that the issue is in memset expansion done by isel. Not in the memset "canonicalization" itself. I think it should either respect the attributes that were added to correctly lower intrinsics among other things or use some different check (i.e. alignment or something else). Anyway, it was just a suggestion. I don't think that it makes the underlying issue invalid. It still needs solving one way or another.