llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

28.75k stars 11.89k forks source link

Inefficient code generated for matrix initialization loop #47049

Open fhahn opened 4 years ago

fhahn commented 4 years ago


Bugzilla Link	47705
Version	trunk
OS	All
CC	@RKSimon

Extended Description

As reported on the mailing list (http://lists.llvm.org/pipermail/llvm-dev/2020-September/145367.html) and discord, GCC beats LLVM on the code below by 4x reportedly

https://godbolt.org/z/4G1rh1

include

define MILLION 1000000

struct Matrix { float E[4][4]; };

int main(void) {

Matrix identity = { {{1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1}} };

int matrix_count = 10 MILLION; Matrix matrices = (Matrix ) malloc(matrix_count sizeof(Matrix));

clock_t begin = clock(); for (int run = 0; run < 25; ++run) { for (int i = 0; i < matrix_count; ++i) { matrices[i] = identity; } } clock_t end = clock(); printf("Value Check: %f\n", matrices[matrix_count / 2].E[2][2]); printf("Time in seconds: %f\n", (double)(end - begin) / CLOCKS_PER_SEC); }

llvmbot commented 3 years ago

FWIW, it seems that the assembly output has changed when using the trunk version, although I'm not sure if it's as efficient as the gcc output.

Also, I tried to find in which clang version this problem was first introduced, and I saw that even 3.3 has the issue.

llvmbot commented 3 years ago

I suppose that you didn't have time for looking into this yet, did you? I look forward to a LLVM release with this fixed...

llvmbot commented 3 years ago

Just out of curiosity, has this been affecting all LLVM versions, or did it start at some version?

fhahn commented 3 years ago

Making SROA prefer load/stores over llvm.memcpy for slices that are relatively small would address the issue: https://reviews.llvm.org/D88893

Of course we should also work on improving other parts of LLVM to improve llvm.memcpy handling.

Is LLVM 12 still affected by this issue, or is there some progress or change that has modified this behaviour?

Unfortunately yes. I did not have time to follow-up on the original patch so far.

llvmbot commented 3 years ago

Making SROA prefer load/stores over llvm.memcpy for slices that are relatively small would address the issue: https://reviews.llvm.org/D88893

Of course we should also work on improving other parts of LLVM to improve llvm.memcpy handling.

Is LLVM 12 still affected by this issue, or is there some progress or change that has modified this behaviour?

fhahn commented 4 years ago

Making SROA prefer load/stores over llvm.memcpy for slices that are relatively small would address the issue: https://reviews.llvm.org/D88893

Of course we should also work on improving other parts of LLVM to improve llvm.memcpy handling.

fhahn commented 4 years ago

Looks like LICM also has problems with hoisting out invariant llvm.memcpy calls: #47053

fhahn commented 4 years ago

2) SROA replaces a single memcpy in the inner loop with 3 separate copies. With SROA disabled , runtime improves again by 2x.

Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA (although the code is not as nice as it code be right now, as there's no syntax for constant initializers for matrix types yet)

fhahn commented 4 years ago

it looks like there are 2 issues which account for the performance difference:

1) GCC interchanges the main loops, so the GCC version has a much better memory access pattern. If the loops are interchanged manually, runtime improves by 3x on my system. LLVM's loop-interchange is disabled by default and also does not support the memcpy calls. Interchanging here gives a nice boost, but that is mostly due to how the benchmark is structured.

2) SROA replaces a single memcpy in the inner loop with 3 separate copies. With SROA disabled , runtime improves again by 2x.

cesss commented 2 years ago

Can it be assumed that this is fixed in 13.0.1? Looking at the assembly output, it has changed, so I wonder if it's supposed to be fixed.