Open LebedevRI opened 3 years ago
It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.
It is cold as per the (static) block frequency, because it's at the returns out of the function (well, because those are destructor calls), which are less frequent than the loops in function body.
I guess this bisects to https://reviews.llvm.org/D28331
How about not using isColdCallSite()
iff there are arguments that are marked as SROA'ble?
It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.
At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this) Yep. There is a number of things that could be fixed at source level, however i think this might not be the best way forward for compiler regressions.
For visitExtractValue(), it could end with
return Base::visitExtractValue();
instead ofreturn false;
, which would forward tovisitInstruction()
which does what you want.
Yep.
Any thoughts regarding having smaller CallPenalty for noreturn calls?
At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this)
For visitExtractValue(), it could end with return Base::visitExtractValue();
instead of return false;
, which would forward to visitInstruction()
which does what you want.
Looked a bit more at InlineCost.cpp
There is one obvious bug: CallAnalyzer::visitExtractValue()
should start with:
// Usually extractvalue
is modelled as free.
if (TTI.getUserCost(&I, TargetTransformInfo::TCK_SizeAndLatency) ==
TargetTransformInfo::TCC_Free)
return true;
that almost gets us over the edge, now the cost is 50 instead of 55, with threshold being 45 (because the callsite is cold).
Now, what i wonder is, perhaps we should be discounting noreturn calls?
PruneEH is not in the new PM, it's supposed to be replaced by a combination of FuncAttrs + SimplifyCFG.
But even in the legacy PM, before being inlined away, _ZN8rawspeed6BufferD2Ev still has the invoke. Perhaps some marking some functions as noexcept is the real fix?
It is not argument promotion, or any IPO pass, as far as I can tell. The functions in questions have external linkage.
My guess is the prune-eh and simplify-cfg are to blame. Diffing the statistics shows the following two are missing in the new pipeline:
5 simplifycfg - Number of invokes with empty resume blocks simplified into calls 2 prune-eh - Number of invokes removed
adding -mllvm -debug-only=inline
shows that some of the calls to _ZN8rawspeed6BufferD2Ev are just barely not inlined.
Inlining (cost=55, threshold=250), Call: call void @​_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(13) %ref.tmp2) #​22
NOT Inlining (cost=55, threshold=45), Call: call void @​_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(20) %29) #​22
Weird that the threshold goes down at some point. -flegacy-pass-manager shows that the threshold stays at 250. I'm also not super familiar with inline cost modeling, InlineCost.cpp is where that mostly lives.
Also, the function after optimizations isn't trivial; if the invoke were just a normal call it'd probably get inlined always.
define linkonce_odr hidden void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer" nonnull dereferenceable(13) %this) unnamed_addr #4 comdat align 2 personality i8 bitcast (i32 (...) @__gxx_personality_v0 to i8) { entry: %isOwner = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 2 %0 = load i8, i8 %isOwner, align 4, !tbaa !12, !range !80 %tobool.not = icmp eq i8 %0, 0 br i1 %tobool.not, label %if.end, label %if.then
if.then: ; preds = %entry %data = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 0 %1 = load i8, i8* %data, align 8, !tbaa !4 invoke void @_ZN8rawspeed19alignedFreeConstPtrEPKv(i8 %1) to label %if.end unwind label %terminate.lpad
if.end: ; preds = %if.then, %entry ret void
terminate.lpad: ; preds = %if.then %2 = landingpad { i8, i32 } catch i8 null %3 = extractvalue { i8, i32 } %2, 0 tail call void @__clang_call_terminate(i8 %3) #24 unreachable }
Extended Description
I'm seeing a rather big +10% runtime perf regression from new-pm switch on a certain codepath.
I don't really have a small workable testcase yet, but there's a full self-contained reproducer attached.
Here's how we can see that the problem is there:
$ clang -O3 -o old.ll -S -emit-llvm orig.ll -flegacy-pass-manager $ clang -O3 -o new.ll -S -emit-llvm orig.ll $ llvm-diff old.ll new.ll <...> function @_ZN8rawspeed6BufferD2Ev exists only in right module <...> in function _ZN8rawspeed23PanasonicDecompressorV511ProxyStream10parseBlockEv: in block %entry:
<...>
Inliner/inline cost model isn't something i'm familiar with, so i'm posting this as is. Any suggestions how this can be approached from the LLVM side?