Open define-private-public opened 4 months ago
It is not clear to me if this is a bug on quick read through the blog. If you could reduce this and put in a godbolt example that would really help. Then we could more easily look at the AST and IR and see if anything obvious pops up.
This is not my area, CC @erichkeane
Without a good minimal example (and there isn't any actionable code on that blog post), I can't really see any reason why this would be the case. It isn't particularly reproducible unfortunately.
As far as I know/can tell, the only difference from the CFE's perspective is that we do a de-virtualization in the case of a 'final' class: https://godbolt.org/z/8q9qM91o5
See how the use of UF
in baz
does a direct call to UF::f
, rather than via a vtable load. SO, clang ended up devirtualizing, which, on its face, should be gain.
Even under opt(https://godbolt.org/z/xqefEE1cj) that is effectively the difference you see. UF
has its function called directly.
@nickdesaulniers is the one who commented on the original HN article, but he's away apparently, so unknown if he can comment. IF we can get some sort of reproducer, it would be nice to see some LLVM folks around to analyze.
MY only (CFE only-knowledge though) guesses are: 1- There is something wrong with the testing methodology. I have no way of knowing one way or the other here, but it is actually quite shocking to me that a: Devirtualization showed so little gain across two compilers, and b; that it showed much negative effects.
2- This hit some sort of pathological case, where the inlining of some now-not-virtual functions resulted in LLVM optimizing it in a way that is non-optimal.
So without some better reproducer, I don't really have an idea, nor even if we did, do I have the expertise here to help.
According to https://news.ycombinator.com/item?id=40156196 and https://gitlab.com/define-private-public/PSRayTracing/-/issues/85, it's related to uniform_real_distribution / logl not being inlined
I'm a little confused, is the actually that the use of final
is actually causing an issue with the uniform_real_distribution? I was thinking that they might be separate issues and Mr. Zhechev might have uncovered a separate issue aside from the A B testing of final
.
Is it possible that final
-> devirtualization -> unconditionally inlining more -> increased icache pressure -> performance drops?
We've seen a variety of performance issues in Chromium due to overly-aggressive inlining (leading to me asking our compiler team to look at tuning inlining heuristics if they can), but I don't know how plausible that is here.
Is it possible that final -> devirtualization -> unconditionally inlining more -> increased icache pressure -> performance drops?
I think it can be the case once they are turned into direct calls, indeed final
seems to aid devirtualization as per:
https://github.com/llvm/llvm-project/blob/d86cc73bbfd9a22d9a0d498d72c9b2ee235128e9/clang/lib/AST/DeclCXX.cpp#L2337-L2339
Out of curiosity, do you have some IR at hand (from Chromium) of overly-aggressive inlining that showed up to be detrimental for performance?
I don't have any offhand, but Chromium installs a custom clang plugin to force authors to move class constructors/destructors out-of-line for any non-tiny class, because of this problem; this is irritating in C++20 because then people can't use aggregates nearly as often. We are also looking at some binary size changes due to different inlining as part of https://issues.chromium.org/issues/40255003; see comments 15 and 16 for some LLVM IR that changed due to using __builtin_trap()
more and modifying inlining as a result. I don't know what the perf impact of the changes on that bug is, though (that one is being pursued to understand the binary size changes more than because we saw a perf hit).
According to https://news.ycombinator.com/item?id=40156196 and https://gitlab.com/define-private-public/PSRayTracing/-/issues/85, it's related to uniform_real_distribution / logl not being inlined
Yeah, looks like we're not folding calls to logl
with constants! EDIT: or any long double
libcalls!
Hi. I published this blog post last week Monday about the
final
keyword. I made a note of a slowdown that was found when using clang/LLVM to compile and run the code.On Hacker News discussion an LLVM developer commented:
I plan on writing a follow up post shortly since the discussion surrounding everything seemed to be a bit more than I anticipated; and I feel a lot of people got the intent of the article wrong.