llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.8k stars 11.91k forks source link

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

Closed davidbolvansky closed 5 months ago

davidbolvansky commented 3 years ago
Bugzilla Link 51460
Version trunk
OS All
CC @alinas,@LebedevRI,@preames,@RKSimon,@MattPD,@nikic,@rotateright,@tstellar

Extended Description

Phoronix reported some significant CT regressions for mplayer and ffmpeg

https://www.phoronix.com/scan.php?page=article&item=clang13-initial-epyc&num=4

LebedevRI commented 3 years ago

Thank you all for participating in this disscussion!

tstellar commented 3 years ago

Can we remove release-13.0.0 from the blocks field?

LebedevRI commented 3 years ago

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was only vaguely relevant here, and decided to save it to my public notes for ease of later use. See https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas. rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling heuristic is too tightly tied to an architectural feature, but I don't think we should avoid correcting the feature value simply because it causes some regressions on that architecture. I would suggest we look at some workarounds to preserve the previous behavior while letting the flag be correct. A few ideas:

  • Cap the scheduler with a fixed upper bound to avoid complexity explosion with a link specifically to the bug to fix.
  • Introduce a "legacy feature for unrolling heuristic" TTI hook which used the old value on that platform, and defaults to the feature value on all other platforms.
    To be honest, i'm not quite convinced there is anything to fix wrt the threshold itself - unrolled loops are great, and as long as loop's body is less than 4K x86 uops (well, let's pretend that it maps 1:1 to 4K LLVM IR ops), then it fit within uop cache and we've also improved cpu dispatch performance.

So yes, if we unroll 100-instruction loop 40 times, and bloat it to 4000 ops, that's fine, ignoring the latent quadratic edge-cases it exposes in the rest of the compiler. Of course, if you care about binary size, perhaps you should use -Os/-Oz :]

But then yes, we probably should reevaluate the cases where the code is strictly sequential, like the contrived example from nikic, those are perhaps not worth unrolling, at least as much.

I think it's unreasonable to ask Roman to rewrite the core partial unrolling heuristic just because he happened to trip across a case where it's broken.

A more constructive approach is workaround it for the moment, and let someone interested with time and resources to tackle unrolling separately.

The thing is, the current threshold (512) is already a compromise/workaround. :)

preames commented 3 years ago

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was only vaguely relevant here, and decided to save it to my public notes for ease of later use. See https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas.rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling heuristic is too tightly tied to an architectural feature, but I don't think we should avoid correcting the feature value simply because it causes some regressions on that architecture. I would suggest we look at some workarounds to preserve the previous behavior while letting the flag be correct. A few ideas:

I think it's unreasonable to ask Roman to rewrite the core partial unrolling heuristic just because he happened to trip across a case where it's broken. A more constructive approach is workaround it for the moment, and let someone interested with time and resources to tackle unrolling separately.

nikic commented 3 years ago

To be clear, my primary concern here is (for once!) not compile-time, but the fact that this simply generates terrible code. Just look at this: https://c.godbolt.org/z/617TjjKhs The compiler shouldn't be unrolling random loops to hundreds of iterations.

While I'd love to see issues like llvm/llvm-project#49928 resolved as a matter of general principle, they are only tangentially related to the problem at hand, which is way too aggressive unrolling. The overzealous unrolling must be fixed independently of any compile-time problems it may also be triggering.

LebedevRI commented 3 years ago

Thank you everyone for your thoughts. It is always fascinating to read what people think.

As a word of caution, i would strongly advise from using some random 3'rd party numbers to pass hard judgments.

Unfortunately, my position still hasn't changed, and this is still WONFIX. Although we can discuss temporarily reducing it iff someone personally promises to resolve llvm/llvm-project#49928 .

Even halving the LoopMicroOpBufferSize (512->256) results in a number of statistically-significant performance regressions (+ ~6%).

davidbolvansky commented 3 years ago

Shouldn't LoopMicroOpBufferSize for znver3 be reduced?

Yeah, there should be a more reasonable value because as wee see for reported projects, regressions are huge (and hardly with 30% better runtime perf to “justify” them).

Reopened, renamed a bit.

nikic commented 3 years ago

I don't think full unrolling is relevant here: LoopMicroOpBufferSize controls the PartialThreshold, which is used for partial and runtime unrolling.

Full unrolling is controlled by different thresholds and a more sophisticated cost model that takes into account how much simplification is expected from breaking the loop backedge. Fully unrolling a relatively large loop can still be beneficial if it results in significant simplification.

Partial and runtime unrolling have more tenuous profitability heuristics, and creating 4096 instruction loops because that's the micro-architectural loop buffer size is almost certainly not profitable. Runtime unrolling will at least limit to 8 unrolls by default, but partial unrolling will be happily unroll to the full threshold.

LebedevRI commented 3 years ago

I agree that cost/benefit model for full loop unrolling is broken. It seems we want both to aggressively fully unroll if we can, and not do that at the same time.

At the very least, i really don't think we should be doing full unrolling before inlining/vectorization, but IMHO we should not be fully unrolling at all, at least not unless we can tell that there will be other benefits than lack of said loop.

+Reames - thoughts on changing profitability heuristics for full unrolling?

nikic commented 3 years ago

Shouldn't LoopMicroOpBufferSize for znver3 be reduced? While it may be correct from a micro-architectural perspective, the way it is used by LLVM is clearly inappropriate: LoopMicroOpBufferSize should be an upper bound on how much we partially unroll, but what we seem to be doing right now is to just unroll maximally up to that size. That's okay if the buffer size is reasonably small, but not if it's very large. That results in massive code bloat for no benefit. While there are scalability problems in other passes, that seems like the part that is causing the immediate problem and should be addressed. As this value appears to only be used for this unrolling heuristics, I'd suggest dropping it to a more sensible value.

LebedevRI commented 3 years ago

This is a duplicate of llvm/llvm-project#49928 This is expected, and will not be addressed for a release.

This bug has been marked as a duplicate of bug llvm/llvm-project#49928

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

RKSimon commented 5 months ago

Closing this - the znver3/4 case was addressed by #91340 / 54e52aa5ebe68de122a3fe6064e0abef97f6b8e0

(The more general threshold limit at #67657 was reverted but should be re-committed at some point)