Open zephyr111 opened 2 years ago
@llvm/issue-subscribers-backend-x86
The inner loop of 4 iterations gets fully unrolled before the vectorizer runs. This creates 4 separate scalar loads each moving 4 elements ahead on the next iteration of the outer loop. Along with 4 scalar FMAs.
With llvm trunk, the loop vectorizer vectorizes each of the FMAs using those strided accesses. So one FMA will work element 0, 4, 8, 12. One will work on element 1, 5, 9, 13, etc. All the shuffles are just trying to rearrange the loaded data into that order.
With the inner loop scalar loop removed, the vectorize produces a loop with 4 vector loads and 4 vector FMAs.
Hello,
The following simple code produces a pretty inefficient assembly code with the flags
-O3 -mavx2 -mfma -ffast-math
whatever the version of Clang used. This can be seen on GodBolt.The automatic vectorization produce FMA operation working on XMM registers instead of YMM and it also use
vunpckhpd
instruction for no apparent reasons. This is the case for all version from Clang 5.0 to Clang 13.0. Note that the use of the__restrict
keyword down not visibly change the outcome.The recent trunk version of Clang on GodBolt (commit 2f18b02d) succeed to use YMM registers but it makes use of many expensive
vperm2f128
andvunpcklpd
instructions.This is possible to perform a much better vectorization using SIMD intrinsics. Here is an example (note that the loop should be unrolled about 4 times so to mitigate the latency of the FMA instructions):
When a register blocking strategy is manually performed, then the generated is even worse. Indeed, it makes use of slow gather instructions instead of packed loads. For more information about this more complex example, please read this Stack-Overflow post.
Note that similar issues appear also with ICC and GCC.