Slowdown from the `final` keyword

define-private-public commented 4 months ago

Hi. I published this blog post last week Monday about the final keyword. I made a note of a slowdown that was found when using clang/LLVM to compile and run the code.

On Hacker News discussion an LLVM developer commented:

As an LLVM developer, I really wish the author filed a bug report and waited for some analysis BEFORE publishing an article (that may never get amended) that recommends not using this keyword with clang for performance reasons. I suspect there's just a bug in clang.

Is this an actual bug?
Would someone be willing to provide analysis as to what's going on?

I plan on writing a follow up post shortly since the discussion surrounding everything seemed to be a bit more than I anticipated; and I feel a lot of people got the intent of the article wrong.

shafik commented 4 months ago

It is not clear to me if this is a bug on quick read through the blog. If you could reduce this and put in a godbolt example that would really help. Then we could more easily look at the AST and IR and see if anything obvious pops up.

This is not my area, CC @erichkeane

erichkeane commented 4 months ago

Without a good minimal example (and there isn't any actionable code on that blog post), I can't really see any reason why this would be the case. It isn't particularly reproducible unfortunately.

As far as I know/can tell, the only difference from the CFE's perspective is that we do a de-virtualization in the case of a 'final' class: https://godbolt.org/z/8q9qM91o5

See how the use of UF in baz does a direct call to UF::f, rather than via a vtable load. SO, clang ended up devirtualizing, which, on its face, should be gain.

Even under opt(https://godbolt.org/z/xqefEE1cj) that is effectively the difference you see. UF has its function called directly.

@nickdesaulniers is the one who commented on the original HN article, but he's away apparently, so unknown if he can comment. IF we can get some sort of reproducer, it would be nice to see some LLVM folks around to analyze.

MY only (CFE only-knowledge though) guesses are: 1- There is something wrong with the testing methodology. I have no way of knowing one way or the other here, but it is actually quite shocking to me that a: Devirtualization showed so little gain across two compilers, and b; that it showed much negative effects.

2- This hit some sort of pathological case, where the inlining of some now-not-virtual functions resulted in LLVM optimizing it in a way that is non-optimal.

So without some better reproducer, I don't really have an idea, nor even if we did, do I have the expertise here to help.

cor3ntin commented 4 months ago

According to https://news.ycombinator.com/item?id=40156196 and https://gitlab.com/define-private-public/PSRayTracing/-/issues/85, it's related to uniform_real_distribution / logl not being inlined

cor3ntin commented 4 months ago

It might be https://github.com/llvm/llvm-project/issues/19916

cor3ntin commented 4 months ago

https://github.com/llvm/llvm-project/issues/19916#issuecomment-2088521829

define-private-public commented 4 months ago

I'm a little confused, is the actually that the use of final is actually causing an issue with the uniform_real_distribution? I was thinking that they might be separate issues and Mr. Zhechev might have uncovered a separate issue aside from the A B testing of final.

pkasting commented 4 months ago

Is it possible that final -> devirtualization -> unconditionally inlining more -> increased icache pressure -> performance drops?

We've seen a variety of performance issues in Chromium due to overly-aggressive inlining (leading to me asking our compiler team to look at tuning inlining heuristics if they can), but I don't know how plausible that is here.

antoniofrighetto commented 4 months ago

Is it possible that final -> devirtualization -> unconditionally inlining more -> increased icache pressure -> performance drops?

I think it can be the case once they are turned into direct calls, indeed final seems to aid devirtualization as per: https://github.com/llvm/llvm-project/blob/d86cc73bbfd9a22d9a0d498d72c9b2ee235128e9/clang/lib/AST/DeclCXX.cpp#L2337-L2339

Out of curiosity, do you have some IR at hand (from Chromium) of overly-aggressive inlining that showed up to be detrimental for performance?

pkasting commented 3 months ago

I don't have any offhand, but Chromium installs a custom clang plugin to force authors to move class constructors/destructors out-of-line for any non-tiny class, because of this problem; this is irritating in C++20 because then people can't use aggregates nearly as often. We are also looking at some binary size changes due to different inlining as part of https://issues.chromium.org/issues/40255003; see comments 15 and 16 for some LLVM IR that changed due to using __builtin_trap() more and modifying inlining as a result. I don't know what the perf impact of the changes on that bug is, though (that one is being pursued to understand the binary size changes more than because we saw a perf hit).

nickdesaulniers commented 3 months ago

According to https://news.ycombinator.com/item?id=40156196 and https://gitlab.com/define-private-public/PSRayTracing/-/issues/85, it's related to uniform_real_distribution / logl not being inlined

Yeah, looks like we're not folding calls to logl with constants! EDIT: or any long double libcalls!

llvm / llvm-project

Slowdown from the `final` keyword #90685