[NewPM] Inliner: Certain destructor calls are no longer inlined, preventing SROA, causing +10% runtime perf regression

LebedevRI commented 3 years ago


Bugzilla Link	50099
Version	trunk
OS	Linux
Attachments	unoptimized IR for reproduction
CC	@aeubanks,@davidbolvansky,@vns-mn,@jdoerfert,@rotateright,@sstefan1

Extended Description

I'm seeing a rather big +10% runtime perf regression from new-pm switch on a certain codepath.

I don't really have a small workable testcase yet, but there's a full self-contained reproducer attached.

Here's how we can see that the problem is there:

$ clang -O3 -o old.ll -S -emit-llvm orig.ll -flegacy-pass-manager $ clang -O3 -o new.ll -S -emit-llvm orig.ll $ llvm-diff old.ll new.ll <...> function @_ZN8rawspeed6BufferD2Ev exists only in right module <...> in function _ZN8rawspeed23PanasonicDecompressorV511ProxyStream10parseBlockEv: in block %entry:

%FirstSection = alloca %"class.rawspeed::Buffer", align 8 %SecondSection = alloca %"class.rawspeed::Buffer", align 8 %0 = bitcast %"class.rawspeed::Buffer" %FirstSection to i8 call void @llvm.lifetime.start.p0i8(i64 16, i8 nonnull %0) #17 tail call void @llvm.experimental.noalias.scope.decl(metadata !135) %pos.i = getelementptr inbounds %"class.rawspeed::PanasonicDecompressorV5::ProxyStream", %"class.rawspeed::PanasonicDecompressorV5::ProxyStream" %this, i64 0, i32 0, i32 1 %0 = load i32, i32* %pos.i, align 4, !tbaa !4, !noalias !9 tail call void @llvm.experimental.noalias.scope.decl(metadata !138)

<...>

Inliner/inline cost model isn't something i'm familiar with, so i'm posting this as is. Any suggestions how this can be approached from the LLVM side?

LebedevRI commented 3 years ago

Posted https://reviews.llvm.org/D101228 and https://reviews.llvm.org/D101229

LebedevRI commented 3 years ago

It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.

It is cold as per the (static) block frequency, because it's at the returns out of the function (well, because those are destructor calls), which are less frequent than the loops in function body.

I guess this bisects to https://reviews.llvm.org/D28331

How about not using isColdCallSite() iff there are arguments that are marked as SROA'ble?

aeubanks commented 3 years ago

It seems like the threshold is the issue, not the cost. You said that the threshold was lower because it's considered a cold call. If this is causing a 10% regression, then maybe the call shouldn't be considered cold and we should be using the higher threshold.

LebedevRI commented 3 years ago

At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this) Yep. There is a number of things that could be fixed at source level, however i think this might not be the best way forward for compiler regressions.

For visitExtractValue(), it could end with return Base::visitExtractValue(); instead of return false;, which would forward to visitInstruction() which does what you want.

Yep.

Any thoughts regarding having smaller CallPenalty for noreturn calls?

aeubanks commented 3 years ago

At the C++ source level, can you mark _ZN8rawspeed19alignedFreeConstPtrEPKv as noexcept? That would almost certainly fix this. (or compile with -fno-exceptions, but maybe that's beyond the scope of this)

For visitExtractValue(), it could end with return Base::visitExtractValue(); instead of return false;, which would forward to visitInstruction() which does what you want.

LebedevRI commented 3 years ago

Looked a bit more at InlineCost.cpp There is one obvious bug: CallAnalyzer::visitExtractValue() should start with: // Usually extractvalue is modelled as free. if (TTI.getUserCost(&I, TargetTransformInfo::TCK_SizeAndLatency) == TargetTransformInfo::TCC_Free) return true;

that almost gets us over the edge, now the cost is 50 instead of 55, with threshold being 45 (because the callsite is cold).

Now, what i wonder is, perhaps we should be discounting noreturn calls?

aeubanks commented 3 years ago

PruneEH is not in the new PM, it's supposed to be replaced by a combination of FuncAttrs + SimplifyCFG.

But even in the legacy PM, before being inlined away, _ZN8rawspeed6BufferD2Ev still has the invoke. Perhaps some marking some functions as noexcept is the real fix?

jdoerfert commented 3 years ago

It is not argument promotion, or any IPO pass, as far as I can tell. The functions in questions have external linkage.

My guess is the prune-eh and simplify-cfg are to blame. Diffing the statistics shows the following two are missing in the new pipeline:

5 simplifycfg - Number of invokes with empty resume blocks simplified into calls 2 prune-eh - Number of invokes removed

aeubanks commented 3 years ago

adding -mllvm -debug-only=inline shows that some of the calls to _ZN8rawspeed6BufferD2Ev are just barely not inlined.

Inlining (cost=55, threshold=250), Call:   call void @&#8203;_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(13) %ref.tmp2) #&#8203;22
NOT Inlining (cost=55, threshold=45), Call:   call void @&#8203;_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer"* nonnull dereferenceable(20) %29) #&#8203;22

Weird that the threshold goes down at some point. -flegacy-pass-manager shows that the threshold stays at 250. I'm also not super familiar with inline cost modeling, InlineCost.cpp is where that mostly lives.

Also, the function after optimizations isn't trivial; if the invoke were just a normal call it'd probably get inlined always.

define linkonce_odr hidden void @_ZN8rawspeed6BufferD2Ev(%"class.rawspeed::Buffer" nonnull dereferenceable(13) %this) unnamed_addr #4 comdat align 2 personality i8 bitcast (i32 (...) @__gxx_personality_v0 to i8) { entry: %isOwner = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 2 %0 = load i8, i8 %isOwner, align 4, !tbaa !12, !range !80 %tobool.not = icmp eq i8 %0, 0 br i1 %tobool.not, label %if.end, label %if.then

if.then: ; preds = %entry %data = getelementptr inbounds %"class.rawspeed::Buffer", %"class.rawspeed::Buffer" %this, i64 0, i32 0 %1 = load i8, i8* %data, align 8, !tbaa !4 invoke void @_ZN8rawspeed19alignedFreeConstPtrEPKv(i8 %1) to label %if.end unwind label %terminate.lpad

if.end: ; preds = %if.then, %entry ret void

terminate.lpad: ; preds = %if.then %2 = landingpad { i8, i32 } catch i8 null %3 = extractvalue { i8, i32 } %2, 0 tail call void @__clang_call_terminate(i8 %3) #24 unreachable }

llvm / llvm-project

[NewPM] Inliner: Certain destructor calls are no longer inlined, preventing SROA, causing +10% runtime perf regression #49443

Extended Description