llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.02k stars 11.96k forks source link

[Unroll] Further unrolling vector reductions creates dependency chains #108028

Open sjoerdmeijer opened 2 months ago

sjoerdmeijer commented 2 months ago

It looks like that our vectorisation strategy is to have some in-loop reduction/dependencies for a simple reduction like this:

for (int i = 0; i < N; i++) {
    sum += a[i];

Because we generate something like this:

vector.body:
   vecsum1 += a[..]
   vecsum2  = a[..] + a[..]
   vecsum1 += vecsum2
   vecsum2  = a[..] + a[..]
   vecsum1 += vecsum2
end
// adding partial sums

But GCC is generating something more like this:

vector.body:
   vecsum1 += a[i:i+4]
   vecsum2 += a[i+4:i+8]
   vecsum3 += a[i+8:i+12]
   vecsum4 += a[i+12:i+16]
end
// adding partial sums

We have more dependency chains in the loop body, which can slow us down.

Here's an AArch64 code example on compiler explorer: https://godbolt.org/z/v1c6hxfGc

I have disabled the interleaver to have a more concise example, but with interleaving things are very similar.

sjoerdmeijer commented 2 months ago

CC: @fhahn, in case you have some thoughts on this.

fhahn commented 2 months ago

I don't think this is an issue directly with LV; Interleaving is disabled via a pragma, so after LV we have (https://godbolt.org/z/3dhczjjzz)

vector.ph:                                        ; preds = %for.cond1.preheader
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x float> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
  %0 = add i64 %index, 0
  %1 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %0
  %2 = getelementptr inbounds float, ptr %1, i32 0
  %wide.load = load <4 x float>, ptr %2, align 4, !tbaa !7
  %3 = fadd fast <4 x float> %wide.load, %vec.phi
  %index.next = add nuw i64 %index, 4
  %4 = icmp eq i64 %index.next, 32000
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !11

middle.block:                                     ; preds = %vector.body
  %5 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> %3)
  br i1 true, label %for.cond.cleanup3, label %scalar.ph

The vector loop later gets unrolled by the unroller, which is a bit surprising as the scalar version didn't get unrolled earlier. LV adds metadata to loops to disable runtime unrolling, but at the moment not regular unrolling

sjoerdmeijer commented 2 months ago

Oh, that's indeed surprising! I hadn't looked into details, was just assuming it was the LV, so thanks for taking a look.

I have never looked at regular unrolling of vector bodies, but do you know if it makes sense to teach it not to do that?

sjoerdmeijer commented 2 months ago

Confirmed that the unroller is causing this.

This will be avoided by a bit more interleaving, then the additional unrolling does not kick in, so this is going to be fixed by https://github.com/llvm/llvm-project/pull/100385 for the Neoverse-V2, but might still be an issue for other targets.