Open vedantk opened 6 years ago
mentioned in issue llvm/llvm-bugzilla-archive#39917
I understand you are not planning to land it now, but anyway I'm curious: Would your patch using ModulePass work when we compile files separately and link them later using linker?
Yes, I think so. If eh_typetable_adopt isn't used, then all of the information needed to perform ISel is fully contained within a single function.
If eh_typetable_adopt is used, there is a restriction. The calling function must have access to the definition of the function which donates its EH type table. In practice this seems to be the case.
I understand you are not planning to land it now, but anyway I'm curious: Would your patch using ModulePass work when we compile files separately and link them later using linker?
I prototyped a ModulePass which lifts EH type tables into IR, and taught CodeExtractor how to insert intrinsics into extracted functions s.t type tables are shared.
The idea is basically what Heejin described in llvm/llvm-bugzilla-archive#39917 , and what John alluded to here: 1) declare the EH type tables in IR, and 2) introduce an intrinsic which allows functions to share type tables. A module pass is needed to identify connected components of functions which all must share the same type tables.
I'll attach the patches here for posterity. Ultimately, what I found is that this does not substantially improve the efficacy of hot/cold splitting. On our stack, it's possible to achieve 98% of the results without actually splitting out landingpads / eh_typeid_for. What's more, the code size growth imposed by shared type tables is greater than the cost of simply treating eh_typeid_for as an input to the extraction region.
In light of this data, I think 'later' is a reasonable resolution for this bug: we can resurrect this idea if extracting landingpads at the IR-level becomes important down the road.
Bug llvm/llvm-bugzilla-archive#39917 has been marked as a duplicate of this bug.
Haven’t had a chance to experiment yet, but here’s another simple solution with only slightly worse codegen: hoist the call to eh.typeid.for and pass it in.
I've disabled extracting calls to eh_typeid_for in r346256 to address the immediate miscompile.
IMO it'd be ideal if we could find a simple way of sharing type info -> ID mappings in gcc_except_tab without regressing its size. Would that be possible if we used DW_EH_PE_absptr as the TType encoding? I'll try it out and report back; if it doesn't pan out, I'll try one of John or Reid's suggestions.
Could we specify an outlined unwind target in IR (say, by adding
unwind function
syntax to theinvoke
instruction)? For code that needs to branch back to non-EH blocks, we might use blockaddress+indirectbr.
That would be a pretty major change to IR, but yeah, if you wanted to (1) outline at the IR level and (2) share an outlined landing pad across functions, you would need something like this. Such a landing pad would have to be something without edges back into the main function body, though.
Makes sense. Doing this increases the aggregate size of gcc_except_tab sections across *.o's in the test-suite by 7.1% (or 39.5 KB). That's not too bad compared to the aggregate size of text (41.8 MB). Can we land that change as a bug fix until we implement something like @llvm.eh.typeid.declare?
The trade-off seems reasonable, but the engineering approach is not. We should not be landing "short-term" changes that regress general code-gen for the benefit of an experimental pass. You should just not outline these intrinsic calls, and if you want the benefits of being able to outline them, you should implement the IR features to make that possible.
Actually, if every function has the same type info -> ID mapping, wouldn't it be better to share the mappings? I suppose that would require changing the format for __gcc_except_tab and updating things that read it. That might be more involved than adding a new intrinsic, but could possibly extra code size savings.
I would love to make a new personality function to optimize code size, but the type info table is probably the least important component of that.
So, from a high level, the best way to "outline" this — to achieve better code density/locality of the non-EH code — is almost certainly just to emit all the code only reachable via landing pads in a separate top-level object which can be placed far away from the normal function. That would, of course, have to be done in the code generator, not as an IR outliner, so it's not directly applicable to your question here.
Why is this out of scope for an IR outliner? With a small linker change, it's possible to order outlined functions towards the end of the __text section.
I was suggesting that the landing pads themselves be moved into the other "function", so that the EH table would tell the unwinder to just resume directly in that function; that's not structurally expressible in IR. Also, EH code sometimes needs to be able to branch back to blocks that are reachable from non-EH blocks, e.g. if there's a catch.
That's a good point. The IR outliner gets some mileage out of extracting blocks dominated by a landing pad, but it could do more.
Could we specify an outlined unwind target in IR (say, by adding unwind function
syntax to the invoke
instruction)? For code that needs to branch back to non-EH blocks, we might use blockaddress+indirectbr.
I see two reasonable approaches for this, both of which require extra work during outlining:
- Adopt Reid's @llvm.eh.outlined.typeid.for idea, which I think would require typeid assignments to be recorded globally.
- Add some sort of @llvm.eh.typeid.declare intrinsic or attribute which assigns specific integer values to specific types and then replaces @llvm.eh.typeid.for calls with the assigned constant.
Thanks for these suggestions!
I tried a different approach, and I'm wondering now whether it's too naive. I made the type info -> type ID mapping global (i.e. shared via MachineModuleInfo). This fixes the bug, but I wonder -- would that bloat the EH tables too much?
Certainly in theory, yeah. A specific function's EH table would only need to contain entries up to the last type it actually uses, but you could end up with a function that has to include dozens of unnecessary entries just because they appear elsewhere in the module.
Makes sense. Doing this increases the aggregate size of gcc_except_tab sections across *.o's in the test-suite by 7.1% (or 39.5 KB). That's not too bad compared to the aggregate size of text (41.8 MB). Can we land that change as a bug fix until we implement something like @llvm.eh.typeid.declare?
Actually, if every function has the same type info -> ID mapping, wouldn't it be better to share the mappings? I suppose that would require changing the format for __gcc_except_tab and updating things that read it. That might be more involved than adding a new intrinsic, but could possibly extra code size savings.
So, from a high level, the best way to "outline" this — to achieve better code density/locality of the non-EH code — is almost certainly just to emit all the code only reachable via landing pads in a separate top-level object which can be placed far away from the normal function. That would, of course, have to be done in the code generator, not as an IR outliner, so it's not directly applicable to your question here.
Why is this out of scope for an IR outliner? With a small linker change, it's possible to order outlined functions towards the end of the __text section.
I was suggesting that the landing pads themselves be moved into the other "function", so that the EH table would tell the unwinder to just resume directly in that function; that's not structurally expressible in IR. Also, EH code sometimes needs to be able to branch back to blocks that are reachable from non-EH blocks, e.g. if there's a catch.
I see two reasonable approaches for this, both of which require extra work during outlining:
- Adopt Reid's @llvm.eh.outlined.typeid.for idea, which I think would require typeid assignments to be recorded globally.
- Add some sort of @llvm.eh.typeid.declare intrinsic or attribute which assigns specific integer values to specific types and then replaces @llvm.eh.typeid.for calls with the assigned constant.
Thanks for these suggestions!
I tried a different approach, and I'm wondering now whether it's too naive. I made the type info -> type ID mapping global (i.e. shared via MachineModuleInfo). This fixes the bug, but I wonder -- would that bloat the EH tables too much?
Certainly in theory, yeah. A specific function's EH table would only need to contain entries up to the last type it actually uses, but you could end up with a function that has to include dozens of unnecessary entries just because they appear elsewhere in the module.
This is IR-level outlining? And you're outlining part of the landing pad, but not the original call?
¥es.
So, from a high level, the best way to "outline" this — to achieve better code density/locality of the non-EH code — is almost certainly just to emit all the code only reachable via landing pads in a separate top-level object which can be placed far away from the normal function. That would, of course, have to be done in the code generator, not as an IR outliner, so it's not directly applicable to your question here.
Why is this out of scope for an IR outliner? With a small linker change, it's possible to order outlined functions towards the end of the __text section.
I see two reasonable approaches for this, both of which require extra work during outlining:
- Adopt Reid's @llvm.eh.outlined.typeid.for idea, which I think would require typeid assignments to be recorded globally.
- Add some sort of @llvm.eh.typeid.declare intrinsic or attribute which assigns specific integer values to specific types and then replaces @llvm.eh.typeid.for calls with the assigned constant.
Thanks for these suggestions!
I tried a different approach, and I'm wondering now whether it's too naive. I made the type info -> type ID mapping global (i.e. shared via MachineModuleInfo). This fixes the bug, but I wonder -- would that bloat the EH tables too much?
This is IR-level outlining? And you're outlining part of the landing pad, but not the original call?
So, from a high level, the best way to "outline" this — to achieve better code density/locality of the non-EH code — is almost certainly just to emit all the code only reachable via landing pads in a separate top-level object which can be placed far away from the normal function. That would, of course, have to be done in the code generator, not as an IR outliner, so it's not directly applicable to your question here.
I see two reasonable approaches for this, both of which require extra work during outlining:
The second idea is probably cleaner. The big disadvantage is that it's totally incompatible with further inlining. The advantage is that the assignment can happen in its own pass, which would allow ordinary intraprocedural optimizations to be run after it, such as switch formation. The outliner then only has to (1) copy the assignment into the outlined function and (2) refuse to outline @llvm.eh.typeid.for calls if for some reason they still exist.
I'm starting to suspect that the underlying bug here is in the lowering of @llvm.eh.typeid.for.
In SelectionDAG, we look up type id's within a MachineFunction:
...
And MachineFunction gives us the next available ID unless it's seen the type info already: ... But, I think this breaks down when you enable outlining. We're passing a type id retrieved from a landingpad in a parent function, to an outlined thunk:
Yes, as designed, landingpad + llvm.eh.typeid.for aren't going to work in the presence of outlining. They were designed with the goal of making inlining easy, and I guess outlining was out of scope. :) llvm.eh.typeid.for produces a selector index into a table in the LSDA of the function it is contained in. Moving it to a different function will give you an index into a different catch info table.
John, any ideas for how to address this?
Here's my (probably bad) idea. Make a new intrinsic, llvm.eh.outlined.typeid.for, that takes two parameters: the function whose LSDA we want to get the selector index from, and the typeinfo whose index we want. So: call i32 @llvm.eh.outlined.typeid.for(i8 bitcast ... @main, i8 bitcast ... @_ZTI3foo)
Right now we lower these intrinsics to plain immediate i32s. This new intrinsic would lower to materialize an MCSymbol into an i32, kind of like we do for llvm.localescape / llvm.localrecover, but more efficient since we promise it's an i32. We'd try to generate: MOV32ri $.Leh_selector_main_ZTI3foo, %EDX
And when we compile main, we'd define the MCSymbol so the assembler can fix it up for us later: .Leh_selector_main_ZTI3foo = 3 # or .set depending on your assembly dialect
At least, that was my preferred way to pass information from one MachineFunction compilation to the next: by abusing MC, since the assembler is the last whole-module thing that LLVM does.
Another idea would be to emit a table of i32s on the side that does translation from the selectors of one function to the selectors of any other. We'd emit them at the end of the module after we've codegened all instances of this intrinsic.
I'm starting to suspect that the underlying bug here is in the lowering of @llvm.eh.typeid.for.
In SelectionDAG, we look up type id's within a MachineFunction:
case Intrinsic::eh_typeid_for: {
// Find the type id for the given typeinfo.
GlobalValue *GV = ExtractTypeInfo(I.getArgOperand(0));
unsigned TypeID = DAG.getMachineFunction().getTypeIDFor(GV);
Res = DAG.getConstant(TypeID, sdl, MVT::i32);
setValue(&I, Res);
return nullptr;
}
And MachineFunction gives us the next available ID unless it's seen the type info already:
unsigned MachineFunction::getTypeIDFor(const GlobalValue *TI) {
for (unsigned i = 0, N = TypeInfos.size(); i != N; ++i)
if (TypeInfos[i] == TI) return i + 1;
TypeInfos.push_back(TI);
return TypeInfos.size();
}
But, I think this breaks down when you enable outlining. We're passing a type id retrieved from a landingpad in a parent function, to an outlined thunk:
%3 = tail call i32 @​llvm.eh.typeid.for(i8* bitcast ({ i8*, i8* }* @​_ZTI3foo to i8*)) #​11
...
call void @​main.cold.1(i32 %2, i8* %1, i32 %i.056) #​8
The outlined thunk tries to compare a passed-in type id to one synthesized within the outlined code:
define internal void @​main.cold.1(i32, i8*, i32 %i.056) #​7 personality i8* bitcast (i32 (...)* @​__gxx_personality_v0 to i8*) {
newFuncRoot:
%2 = tail call i32 @​llvm.eh.typeid.for(i8* bitcast (i8** @​_ZTIi to i8*)) #​11
%matches1 = icmp eq i32 %0, %2
...
At which point it's comparing the wrong ids, because the outlined thunk's MachineFunction has an incompatible TypeInfos list. I think the fix here would be to ensure that an outlined thunk has the same type info -> ID map as its parent.
Full disclosure -- I've been looking at a failure in SingleSource/Regression/C++/EH/throw_rethrow_test.cpp to debug this. It's the only other test-suite failure with D53887 applied, and it's simpler to understand.
The reference output is:
0: 1
1: 1
2: 1
3: 2
4: 2
5: 2
6: 3
7: 3
8: 3
exit 0
With D53887, the actual output is:
0: 2
1: 2
2: 2
3: 1
4: 1
5: 1
6: 3
7: 3
8: 3
exit 0
Looking at the IR I think Code Extractor (CE) isn't handling Multiple Exit Regions correctly.
I missed this. What do you think looks suspect?
Do we have examples/testcases for ME regions which work?
Here's one: https://github.com/llvm-mirror/llvm/blob/master/test/Transforms/HotColdSplit/multiple-exits.ll
How does CE handle branches outside the region, for e.g, does it insert return statements properly?
Skimming emitCallAndSwitchStatement, it looks like CE finds all exits which are outside of the to-be-extracted region and creates stub blocks for them in the outlined function. Branches are rewritten to jump to the right exit stub, and the exit stub returns a number which the caller can switch on. There's some logic to simplify/eliminate the switch in simple cases.
Also, how about the side-effects (stores etc.); for ME the side-effects also flow from multiple paths and CE may not be handling those correctly.
I'm not sure whether there's an issue here, but fwiw, I'm seeing the incorrect results from single-exit outlined regions containing __cxa_throw.
Looking at the IR I think Code Extractor (CE) isn't handling Multiple Exit Regions correctly. Do we have examples/testcases for ME regions which work? How does CE handle branches outside the region, for e.g, does it insert return statements properly? Also, how about the side-effects (stores etc.); for ME the side-effects also flow from multiple paths and CE may not be handling those correctly.
assigned to @vedantk
Adding the patches posted by Vedant as text files.
class_hierarchy-with-hot-cold-splitting.ll.txt class_hierarchy-without-hot-cold-splitting.ll.txt
0001-WIP-ADT-Add-getArrayRef-to-UniqueVector.patch.txt 0002-WIP-EH-Explicitly-declare-type-tables-in-IR.patch.txt 0003-WIP-EH-Share-type-tables-when-outlining.patch.txt
Duplicate of: https://github.com/llvm/llvm-project/issues/39264
Other issue is marked as duplicate of this one.
Adding attachments from: https://github.com/llvm/llvm-project/issues/39264 as this is an active bug.
no-outlining.ll.txt with-outlining.bad.S.txt with-outlining.good.S.txt with-outlining.ll.txt
Extended Description
When running this test program with
-hot-cold-split=true
(with https://reviews.llvm.org/D53887 applied), we’re supposed to see this output:Instead, we see this:
Here’s where we crash:
Regression-C++-class_hierarchy`main.cold.2: 0x100001d47 <+0>: pushq %rbp 0x100001d48 <+1>: movq %rsp, %rbp 0x100001d4b <+4>: pushq %rbx 0x100001d4c <+5>: pushq %rax 0x100001d4d <+6>: cmpl $0x1, %edi 0x100001d50 <+9>: jne 0x100001d6c ; <+37> 0x100001d52 <+11>: movq %rsi, %rdi 0x100001d55 <+14>: callq 0x100001dcc ; symbol stub for: __cxa_begin_catch 0x100001d5a <+19>: movb 0x8(%rax), %cl 0x100001d5d <+22>: addb $0x30, %cl 0x100001d60 <+25>: movq 0x10(%rax), %rdx -> 0x100001d64 <+29>: movb %cl, (%rdx)
In the outlined catch path, we pass 0x101 (why?) to __cxa_begin_catch:
(lldb) reg read $rdi rdi = 0x0000000000000101
__cxa_begin_catch then returns a pointer to:
(lldb) x/8 $rax 0x100202e60: 0x00001f04 0x00000001 0x00000000 0x00000000 0x100202e70: 0x00000000 0x00000000 0x00000000 0x00000000
Clearly [$rax+16] is null, and we crash when we attempt to dereference that. But something looks like it's gone wrong by the point __cxa_begin_catch is called.
Looking at the IR before and after outlining, the issue isn't immediately jumping out at me. I've filed this because I'm worried this isn't just an issue with D53887, i.e. that there might be a more general problem with how CodeExtractor handles landingpads. I've attached the IR here.