Disable vectorisation during pre-link LTO stage

igogo-x86 commented 3 months ago

There is a scenario when the PreLTO LoopVectorize fails to vectorise code, and SLPVectorizer would transform some chunks of code within the loop. During link-time LTO, new opportunities after inlining may allow to vectorise loop, but it would be spoiled by SLP. I don't see how this can be resolved optimally without changing the optimisation pipeline.

My real-life example is like this:

void func {
   a = malloc();
   b = malloc();
   c = malloc();
   ...
   would_be_inlined(a, b, c, ...);
}

void would_be_inlined(a, b, c, ...) {
   ...
   loop {
     // Writes to a, b and c only. Reads from some other pointers
     // Before inlining LLVM does not know that write pointers don't alias read pointers and thus does not vectorise
     // After LTO vectorisation is highly profitable but SLP already spoils the loop with overall legit optimisations.
   }
   ...
}

I can also imagine a scenario where small functions get inlined into a loop only after SLP vectorised.

GCC doesn't run vectorisation before LTO. I wonder if LLVM could do the same and where resistance against it comes from. Surely, there will be some regressions, but they couldn't be as unsolvable as the one described above.

igogo-x86 commented 3 months ago

@davemgreen @nikic

igogo-x86 commented 3 months ago

After turning off vectorisation, the pre-LTO x264 benchmark from spec2017 has regression. The situation there can be described this way:

func a {
    b(...);
    b(...);
    b(...);
    b(...);
}

func b {
   ...
}

Before running the SLP vectoriser, function b is considered too big to be inlined int a. After SLP b gets inlined, it opens new opportunities for SLP vectorisation.

If I turn off vectorisation pre-LTO, there is no longer a second SLP run, and extra vectorisation opportunities are missed.

It can be fixed by adding extra inlining and SLPVectorisation close to the end of the post-link LTO pipeline, but I am not sure if that is an acceptable way to fix the regression.

davemgreen commented 2 months ago

Review https://reviews.llvm.org/D148010 attempt to rewrite the LTO pass pipeline so that unrolling and loop-vectorization are not performed during pre-lto. This is, from a performance point of view, a large change and can lead to a lot of regressions. But probably makes more sense if we can simplify more during LTO, but can lead to phase-ordering issues.

The example above, to make it more concrete, is this function from x264: https://github.com/mirror/x264/blob/eaa68fad9e5d201d42fde51665f2d137ae96baf0/common/pixel.c#L288 Where doing SLP pre-LTO can lead to extra inlining as the costs are a lot lower https://godbolt.org/z/4j8norofn

The other issue I have heard of from x264 is the quant_4x4 function captured in llvm/test/Transforms/PhaseOrdering/AArch64/quant_4x4.ll. From what I understand loop-vectorization needs to happen before some simplifications+unrolling, as SLP cannot yet handle the unrolled form.

llvm / llvm-project

Disable vectorisation during pre-link LTO stage #98487