Open danilaml opened 1 year ago
@llvm/issue-subscribers-backend-x86
But even ignoring that, I'm pretty sure that using zmm stores in this case is more efficient that ymm even on targets that prefer 256 bit vectors. Especially if the dest is aligned by 64 bytes.
We support prefer-vector-width in a function granularity. It means we cannot selectively generate ZMM instructions for some efficient instructions but YMM for the rest.
@phoebewang This attribute doesn't work with inlining (also, clang's min_vector_width
appears to be ignored completely in my tests, see https://github.com/llvm/llvm-project/issues/60946, but that's a separate issue, I'm more interested in IR). As far as I'm aware, only "min-legal-vector-width" is supposed to be properly supported by inlining, but this property is ingored by the linked check in call lowering.
See: https://godbolt.org/z/E4Wx8chqG
@topperc perhaps could explain better, since they added this attribute in https://reviews.llvm.org/D42724 and help me understand how it works.
We have some intrinsics like llvm.memset
that have special handling in SelectionDAGBuilder.
For llvm.memset
, it allows target code to determine the best vector type.
It is much similar to #60946 that optimizations (this time it's in backend) honor prefer_vector_width
and ignore min_vector_width
.
The behavior meets the expectation of the design. As I explained in #60946, min_vector_width
is not designed to override prefer_vector_width
. It only works during ISelLowering when 1) we have 512-bit vector types in IR/Nodes 2) min_vector_width > prefer_vector_width
. In this case, we build 256-bit nodes at the beginning of ISel. So we don't have a chance to lower it to ZMM instructions.
@phoebewang that seems wrong. What is the purpose of min-vector-wdtih when using avx512 intrinsics? Why does frontend annotate the function with it if it's just ignored?
I don't think it should override prefer_vector_width, I think it should override the check in iselLowering.
Because at the moment the codegen is really inconsistent (use one intrinsic - get zmm move, use two - get 4 ymm moves).
Is there any case where exaping memset (or other mem intrincs) using zmm would be slower than ymm? Especially when specifically asked for, via check in Subtarget.useAVX512Regs()
?
At the moment llvm just generates noticeably slower code (up to 20% in my benchmarks) and there is no way to nudge it into using better instructions, short of using inline asm (or disabling MemCpyOpt, or applying some code hacks).
What is the purpose of min-vector-wdtih when using avx512 intrinsics? Why does frontend annotate the function with it if it's just ignored?
The short answer, for ABI. 512-bit vector passing/returning agruments must go through ZMM registers. Other cases are out of the scope of the design.
I don't think it should override prefer_vector_width, I think it should override the check in iselLowering. Because at the moment the codegen is really inconsistent (use one intrinsic - get zmm move, use two - get 4 ymm moves).
The root difference comes from front-end https://godbolt.org/z/bjnoz7aG1
As I explained above, we have more chance to optimize llvm.memset
Is there any case where exaping memset (or other mem intrincs) using zmm would be slower than ymm? Especially when specifically asked for, via check in
Subtarget.useAVX512Regs()
?
I don't have an answer for it without measuring on physical machines. You are welcome to improve it if you have strong data support. Note, we need to consider different targets to avoid regressions.
If using ZMM register is always a win, we can simply remove this condition rather than relying min_vector_width
.
At the moment llvm just generates noticeably slower code (up to 20% in my benchmarks) and there is no way to nudge it into using better instructions, short of using inline asm (or disabling MemCpyOpt, or applying some code hacks).
The simple way is to set "prefer-vector-width"="512"
for the specific function. It is a typical scenario you prefer 512 to 256 :)
@phoebewang if it's "just" for ABI when why are AVX-512 intrinsics annotated with this attribute?
The root difference comes from front-end https://godbolt.org/z/bjnoz7aG1 As I explained above, we have more chance to optimize llvm.memset
Not sure where are you getting "frontend" from. Folding consecutive stores into memset is a backend optimization. Have you looked at my second godbolt link? There is no frontend there, just pure IR: https://godbolt.org/z/E4Wx8chqG (runs clang to show asm, but you can just switch to opt and see that it folds stores to memset that will be expanded via ymm)
If using ZMM register is always a win, we can simply remove this condition rather than relying min_vector_width.
At least it's always a win for sized from 64 to some threshold (tested to about 16 KB). And with aligned enough data. (didn't check the unaligned case, since the function I've benchmarked has a preloop to align pointers).
The simple way is to set "prefer-vector-width"="512" for the specific function. It is a typical scenario you prefer 512 to 256 :)
This doesn't work, as demonstrated by the godbolt link above. Only min-legal-vector-width survives inlining. "prefer-vector-width" really only works as a global option, or when you are writing application code maybe. You can't use that for library function (or at least you are not in control of the code that is going to call it). Not to mention that there doesn't seem to be a way to do that in clang, so someone trying to use avx512 intrinsics still can't make compiler generate intended asm without relying on hacks, like making a separate TU with different options (and good luck if you need to mix function with different prefer-vector-width attributes). Although I'm more concerned by IR.
@phoebewang if it's "just" for ABI when why are AVX-512 intrinsics annotated with this attribute?
Because all of these intrinsics have at least one 512-bit argument.
Not sure where are you getting "frontend" from. Folding consecutive stores into memset is a backend optimization.
Ok, my fault. I sometimes take passes before "-emit-llvm" as "frontend". Actually it's a middle end pass "MemCpyOpt". Anyway, we can't call it backend optimization. Or just like me, you used to take all passes after frontend as backend :)
Have you looked at my second godbolt link? There is no frontend there, just pure IR: https://godbolt.org/z/E4Wx8chqG (runs clang to show asm, but you can just switch to opt and see that it folds stores to memset that will be expanded via ymm)
IR has the same problem as C code, i.e., the "MemCpyOptPass" conditionally turns store insts into memset. That's the root cause you observed inconsistency.
At least it's always a win for sized from 64 to some threshold (tested to about 16 KB).
What target you are testing. My concern is it might win on latest hardware but lose on old ones.
This doesn't work, as demonstrated by the godbolt link above. Only min-legal-vector-width survives inlining.
It's just a difference in merging strategy between "min-legal-vector-width" and "prefer-vector-width". "min-legal-vector-width" is not designed for optimization and should not used for it.
"prefer-vector-width" really only works as a global option, or when you are writing application code maybe.
I don't get the point. "prefer-vector-width" is a function attribute, why do you think it's a global option. You are free to manipulate it in the IR. Or you mean the frontend lacks a attribute support like min_vector_width(512) for C/C++ code?
You can't use that for library function (or at least you are not in control of the code that is going to call it).
Isn't the library function always in a different TU with caller and vice versa? I don't understand according to the context. Shouldn't happen to work even it's "a global option"?
Not to mention that there doesn't seem to be a way to do that in clang
This can be solved by adding a prefer_vector_width(N) in the frontend.
so someone trying to use avx512 intrinsics still can't make compiler generate intended asm without relying on hacks, like making a separate TU with different options (and good luck if you need to mix function with different prefer-vector-width attributes)
I don't see any risk to mix function with different prefer-vector-width attributes. Generating YMM instructions or ZMM ones in function body won't affect the correctness. It's different from arugments, that's why we need "min-legal-vector-width" for ABI. So it is just a thing of optimization and should be only solved with optimization option, i.e., using prefer-vector-width for given functions.
Do we have no other workarounds? No. Since the root cause is about "MemCpyOptPass", we can try to disable it for given scenarios. I found a simple way to do so when reading the code https://godbolt.org/z/G5c1xPe3E, we can also adding a knob to do it as well.
Because all of these intrinsics have at least one 512-bit argument.
But they are "alwaysinline". They probably disappear even before reaching IR.
Anyway, we can't call it backend optimization. Or just like me, you used to take all passes after frontend as backend :)
Ok. It's more like middle end anyway.
IR has the same problem as C code, i.e., the "MemCpyOptPass" conditionally turns store insts into memset. That's the root cause you observed inconsistency.
Yes and no. The root cause is memset expansion done by isel. LLVM inconsistently "splits" memset into ymm stores, but doesn't split IR stores (as it should, since it's annotated by min-legal-vector-width).
What target you are testing. My concern is it might win on latest hardware but lose on old ones.
So far, I've run the tests on cascadelake and icelake server CPUs. I'll check on skylake-x later (if it's a win there, then it's definitely not worse on all others I presume).
It's just a difference in merging strategy between "min-legal-vector-width" and "prefer-vector-width".
There is no merge strategy for prefer-vector-width. Are you suggesting that it should be added (same as min-legal-vector-width)?
"min-legal-vector-width" is not designed for optimization and should not used for it.
I got impression from @topperc that it is supposed to be used for intrinsics as well. In fact, the original phab review (https://reviews.llvm.org/D42724) has this in its description:
The idea is that this would be set based on ABI requirements, intrinsics or explicit vector types being used, maybe simd pragmas, etc
Isn't the library function always in a different TU with caller and vice versa? I don't understand according to the context. Shouldn't happen to work even it's "a global option"?
Not if they are supposed to be always_inline (or considered for inlining).
This can be solved by adding a prefer_vector_width(N) in the frontend.
Where? You don't know where the function that needs to preserve its wide stores will be inlined beforehand. One can reimplement/copy min-legal-vector-width inlining strategy for prefer* attr but I fear there may be some unintended consequences. It's also not exactly true - I don't really want to override preferred vector. I simply want the explicit IR/intrinsics I used to not be split further, without dropping down to inline asm (which would be incredibly clunky for this simple case).
I don't see any risk to mix function with different prefer-vector-width attributes
Because they don't mix currently when inlining is considered. Also, prefer-vetor-width might have unwanted implication on other code - i.e. on skylake-x it may avx-512 downclocking issues due to avx-512 code generation (which doesn't or only slightly affect load/store instructions).
Do we have no other workarounds? No. Since the root cause is about "MemCpyOptPass", we can try to disable it for given scenarios. I found a simple way to do so when reading the code https://godbolt.org/z/G5c1xPe3E, we can also adding a knob to do it as well.
I don't see how MemCpyOptPass is the issue. It generates memset, sure. This in and of itself is fine. The issue happens later, when memset is expanded not using the IR it was created from (and the IR that's slower as well).
I found a simple way to do so when reading the code https://godbolt.org/z/G5c1xPe3E, we can also adding a knob to do it as well.
This generates nontemporal moves which has all kinds of implications, besides just being different asm.
But they are "alwaysinline". They probably disappear even before reaching IR.
That's probably the reason why we need to merge min-legal-vector-width into the caller.
Yes and no. The root cause is memset expansion done by isel. LLVM inconsistently "splits" memset into ymm stores, but doesn't split IR stores (as it should, since it's annotated by min-legal-vector-width).
I don't see it a problem. "memset expansion done by isel" is an optimization. It's common an optimization only recognizes some of IR instructions.
So far, I've run the tests on cascadelake and icelake server CPUs. I'll check on skylake-x later (if it's a win there, then it's definitely not worse on all others I presume).
Per to discussions when adding the code, Skylake and Icelake would have better performance with 256-bit loads/stores.
I got impression from @topperc that it is supposed to be used for intrinsics as well.
Right, besides ABI, many intrinsics' implementation are mapped to single instructions directly. That saids, min-legal-vector-width is required for them, otherwise, compiler crash during ISelLowering. OTOH, non directly mapped intrinsics can be lowered without min-legal-vector-width. In a word, here it still emphasizes it's used for functional correctness. It's not designed for arbitrary use.
Not if they are supposed to be always_inline (or considered for inlining).
Right, I subconsciously think functions are only built into the library binary.
Where? You don't know where the function that needs to preserve its wide stores will be inlined beforehand. One can reimplement/copy min-legal-vector-width inlining strategy for prefer* attr but I fear there may be some unintended consequences.
Because they don't mix currently when inlining is considered. Also, prefer-vetor-width might have unwanted implication on other code - i.e. on skylake-x it may avx-512 downclocking issues due to avx-512 code generation (which doesn't or only slightly affect load/store instructions).
That's true, but min-legal-vector-width has the same problem, see https://godbolt.org/z/P9eje6z68
It's also not exactly true - I don't really want to override preferred vector. I simply want the explicit IR/intrinsics I used to not be split further, without dropping down to inline asm (which would be incredibly clunky for this simple case).
I don't see how MemCpyOptPass is the issue. It generates memset, sure. This in and of itself is fine. The issue happens later, when memset is expanded not using the IR it was created from (and the IR that's slower as well).
This is just the difference of perspective we see it. It doesn't look to me an issue that preferred vector affects intrinsics lowering, as long as there's no correctness issue. Whether it is a valid optimization or not is another topic.
This generates nontemporal moves which has all kinds of implications, besides just being different asm.
Not sure. The asm seems generated as expected.
That's probably the reason why we need to merge min-legal-vector-width into the caller.
Why? The backend would just legalize it using prefer vector width, like with memset. Won't break anything.
Per to discussions when adding the code, Skylake and Icelake would have better performance with 256-bit loads/stores.
Don't see any discussion or link to benchmarks in the bug. At most, you could infer from one comment that skylake would be worse with unaligned 512 ops vs two unaligned 256 ops. Maybe, as I said, in my benchmark I have a preloop that aligns the data, so the code that gets folded is actually align 64
. There is a TODO in this code. Do you think changing it to ("prefer-vector-width" OR align >= 64) be enough?
That saids, min-legal-vector-width is required for them, otherwise, compiler crash during ISelLowering.
For some specific ones (i.e. the ones represented by intrinsis in IR), maybe. General IR ones (i.e. that map to regular instructions on vectors) are just legalized by splitting.
That's true, but min-legal-vector-width has the same problem, see https://godbolt.org/z/P9eje6z68
Wouldn't call it the same. It's similar but It's smaller in scope and easier to manage, since you need to explicitely use wide vectors in the code most o the time to see this effect (vectorizers can sometime produce them too, but usually other parts of the code adher to pref-vector-width).
It doesn't look to me an issue that preferred vector affects intrinsics lowering, as long as there's no correctness issue. Whether it is a valid optimization or not is another topic.
I see it as a problem because there is no other (sane) way to generate the desired code otherwise and the part that is supposed to do that (intrinsics) does so inconsistently (can randomly do or do not simply depending on wether memset folding triggers). And this leads to a noticeable performace regressions (15-20% range).
Not sure. The asm seems generated as expected.
asm from the godbolt link:
foo: # @foo
.Lfoo$local:
vpternlogd $255, %zmm0, %zmm0, %zmm0
vmovntdq %zmm0, (%rdi)
vmovntdq %zmm0, 64(%rdi)
vzeroupper
retq
Why? The backend would just legalize it using prefer vector width, like with memset. Won't break anything.
Because some intrinsics require min-legal-vector-width set to the correct value. prefer-vector-width will force 512-bit vector illegal and result in lowering fail without min-legal-vector-width correctly set. https://godbolt.org/z/xvxqY3szb
I see it as a problem because there is no other (sane) way to generate the desired code otherwise and the part that is supposed to do that (intrinsics) does so inconsistently (can randomly do or do not simply depending on wether memset folding triggers). And this leads to a noticeable performace regressions (15-20% range).
Doesn't make sense to me. The problem is the current "optimization" has regressions on some target as you described. You can tune it for specific targets like many other tuning we have done. The most similar one is https://reviews.llvm.org/D134982 It's not fair to call it a problem to an independent attribute.
Because some intrinsics require min-legal-vector-width set to the correct value. prefer-vector-width will force 512-bit vector illegal and result in lowering fail without min-legal-vector-width correctly set. https://godbolt.org/z/xvxqY3szb
I've already addressed this in
For some specific ones (i.e. the ones represented by intrinsis in IR), maybe. General IR ones (i.e. that map to regular instructions on vectors) are just legalized by splitting
Doesn't make sense to me. The problem is the current "optimization" has regressions on some target as you described. You can tune it for specific targets like many other tuning we have done. The most similar one is https://reviews.llvm.org/D134982 It's not fair to call it a problem to an independent attribute.
I'm a bit lost. What doesn't make sense? What problem to an independent attribute? I only said that I believe that the issue is in memset expansion done by isel. Not in the memset "canonicalization" itself. I think it should either respect the attributes that were added to correctly lower intrinsics among other things or use some different check (i.e. alignment or something else). Anyway, it was just a suggestion. I don't think that it makes the underlying issue invalid. It still needs solving one way or another.
For the following C/C++ code:
Clang with
-O3 -mcpu=icelake-server -force-vector-width=512 -print-after-all -debug
would generate the following assembly:Godbolt link: https://godbolt.org/z/7fTPjfff8
But if you comment the second store in the
baz
the zmm store is generated. Looks like when there are several consecutive stores, they get folded intomemset
intrinsic on IR, which is then expanded during selection dag creation using preferred vector width, and not actually requested one. But even ignoring that, I'm pretty sure that using zmm stores in this case is more efficient that ymm even on targets that prefer 256 bit vectors. Especially if the dest is aligned by 64 bytes.Somewhat related to https://github.com/llvm/llvm-project/issues/42585
Perhaps changing this line to
Subtarget.useAVX512Regs()
? https://github.com/llvm/llvm-project/blob/ca611affd3e5dfe00e6ebe0488994bf93c2d135c/llvm/lib/Target/X86/X86ISelLoweringCall.cpp#L285