Open igogo-x86 opened 3 months ago
@davemgreen @nikic
After turning off vectorisation, the pre-LTO x264 benchmark from spec2017 has regression. The situation there can be described this way:
func a {
b(...);
b(...);
b(...);
b(...);
}
func b {
...
}
Before running the SLP vectoriser, function b
is considered too big to be inlined int a
. After SLP b
gets inlined, it opens new opportunities for SLP vectorisation.
If I turn off vectorisation pre-LTO, there is no longer a second SLP run, and extra vectorisation opportunities are missed.
It can be fixed by adding extra inlining and SLPVectorisation close to the end of the post-link LTO pipeline, but I am not sure if that is an acceptable way to fix the regression.
Review https://reviews.llvm.org/D148010 attempt to rewrite the LTO pass pipeline so that unrolling and loop-vectorization are not performed during pre-lto. This is, from a performance point of view, a large change and can lead to a lot of regressions. But probably makes more sense if we can simplify more during LTO, but can lead to phase-ordering issues.
The example above, to make it more concrete, is this function from x264: https://github.com/mirror/x264/blob/eaa68fad9e5d201d42fde51665f2d137ae96baf0/common/pixel.c#L288 Where doing SLP pre-LTO can lead to extra inlining as the costs are a lot lower https://godbolt.org/z/4j8norofn
The other issue I have heard of from x264 is the quant_4x4 function captured in llvm/test/Transforms/PhaseOrdering/AArch64/quant_4x4.ll. From what I understand loop-vectorization needs to happen before some simplifications+unrolling, as SLP cannot yet handle the unrolled form.
There is a scenario when the PreLTO LoopVectorize fails to vectorise code, and SLPVectorizer would transform some chunks of code within the loop. During link-time LTO, new opportunities after inlining may allow to vectorise loop, but it would be spoiled by SLP. I don't see how this can be resolved optimally without changing the optimisation pipeline.
My real-life example is like this:
I can also imagine a scenario where small functions get inlined into a loop only after SLP vectorised.
GCC doesn't run vectorisation before LTO. I wonder if LLVM could do the same and where resistance against it comes from. Surely, there will be some regressions, but they couldn't be as unsolvable as the one described above.