Open sjoerdmeijer opened 2 months ago
CC: @fhahn, in case you have some thoughts on this.
I don't think this is an issue directly with LV; Interleaving is disabled via a pragma, so after LV we have (https://godbolt.org/z/3dhczjjzz)
vector.ph: ; preds = %for.cond1.preheader
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x float> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
%0 = add i64 %index, 0
%1 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %0
%2 = getelementptr inbounds float, ptr %1, i32 0
%wide.load = load <4 x float>, ptr %2, align 4, !tbaa !7
%3 = fadd fast <4 x float> %wide.load, %vec.phi
%index.next = add nuw i64 %index, 4
%4 = icmp eq i64 %index.next, 32000
br i1 %4, label %middle.block, label %vector.body, !llvm.loop !11
middle.block: ; preds = %vector.body
%5 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> %3)
br i1 true, label %for.cond.cleanup3, label %scalar.ph
The vector loop later gets unrolled by the unroller, which is a bit surprising as the scalar version didn't get unrolled earlier. LV adds metadata to loops to disable runtime unrolling, but at the moment not regular unrolling
Oh, that's indeed surprising! I hadn't looked into details, was just assuming it was the LV, so thanks for taking a look.
I have never looked at regular unrolling of vector bodies, but do you know if it makes sense to teach it not to do that?
Confirmed that the unroller is causing this.
This will be avoided by a bit more interleaving, then the additional unrolling does not kick in, so this is going to be fixed by https://github.com/llvm/llvm-project/pull/100385 for the Neoverse-V2, but might still be an issue for other targets.
It looks like that our vectorisation strategy is to have some in-loop reduction/dependencies for a simple reduction like this:
Because we generate something like this:
But GCC is generating something more like this:
We have more dependency chains in the loop body, which can slow us down.
Here's an AArch64 code example on compiler explorer: https://godbolt.org/z/v1c6hxfGc
I have disabled the interleaver to have a more concise example, but with interleaving things are very similar.