Open fhahn opened 4 years ago
FWIW, it seems that the assembly output has changed when using the trunk version, although I'm not sure if it's as efficient as the gcc output.
Also, I tried to find in which clang version this problem was first introduced, and I saw that even 3.3 has the issue.
I suppose that you didn't have time for looking into this yet, did you? I look forward to a LLVM release with this fixed...
Just out of curiosity, has this been affecting all LLVM versions, or did it start at some version?
Making SROA prefer load/stores over llvm.memcpy for slices that are relatively small would address the issue: https://reviews.llvm.org/D88893
Of course we should also work on improving other parts of LLVM to improve llvm.memcpy handling.
Is LLVM 12 still affected by this issue, or is there some progress or change that has modified this behaviour?
Unfortunately yes. I did not have time to follow-up on the original patch so far.
Making SROA prefer load/stores over llvm.memcpy for slices that are relatively small would address the issue: https://reviews.llvm.org/D88893
Of course we should also work on improving other parts of LLVM to improve llvm.memcpy handling.
Is LLVM 12 still affected by this issue, or is there some progress or change that has modified this behaviour?
Making SROA prefer load/stores over llvm.memcpy for slices that are relatively small would address the issue: https://reviews.llvm.org/D88893
Of course we should also work on improving other parts of LLVM to improve llvm.memcpy handling.
Looks like LICM also has problems with hoisting out invariant llvm.memcpy calls: #47053
2) SROA replaces a single memcpy in the inner loop with 3 separate copies. With SROA disabled , runtime improves again by 2x.
Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA (although the code is not as nice as it code be right now, as there's no syntax for constant initializers for matrix types yet)
it looks like there are 2 issues which account for the performance difference:
1) GCC interchanges the main loops, so the GCC version has a much better memory access pattern. If the loops are interchanged manually, runtime improves by 3x on my system. LLVM's loop-interchange is disabled by default and also does not support the memcpy calls. Interchanging here gives a nice boost, but that is mostly due to how the benchmark is structured.
2) SROA replaces a single memcpy in the inner loop with 3 separate copies. With SROA disabled , runtime improves again by 2x.
Can it be assumed that this is fixed in 13.0.1? Looking at the assembly output, it has changed, so I wonder if it's supposed to be fixed.
Extended Description
As reported on the mailing list (http://lists.llvm.org/pipermail/llvm-dev/2020-September/145367.html) and discord, GCC beats LLVM on the code below by 4x reportedly
https://godbolt.org/z/4G1rh1
include
include
include
define MILLION 1000000
struct Matrix { float E[4][4]; };
int main(void) {
Matrix identity = { {{1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1}} };
int matrix_count = 10 MILLION; Matrix matrices = (Matrix ) malloc(matrix_count sizeof(Matrix));
clock_t begin = clock(); for (int run = 0; run < 25; ++run) { for (int i = 0; i < matrix_count; ++i) { matrices[i] = identity; } } clock_t end = clock(); printf("Value Check: %f\n", matrices[matrix_count / 2].E[2][2]); printf("Time in seconds: %f\n", (double)(end - begin) / CLOCKS_PER_SEC); }