Open siboehm opened 1 year ago
@siboehm this manual inlining, how would one go about implementing it? There is
def _populate_forest_func(forest, root_func, tree_funcs, fblocksize):
"""Populate root function IR for forest"""
assert fblocksize > 0
# generate the setup-blocks upfront, so each instruction_block can be passed its successor
instr_blocks = [
(
root_func.append_basic_block("instr-block-setup"),
tree_funcs[i : i + fblocksize],
)
for i in range(0, len(tree_funcs), fblocksize)
]
term_block = root_func.append_basic_block("term")
ir.IRBuilder(term_block).ret_void()
for i, (setup_block, tree_func_chunk) in enumerate(instr_blocks):
next_block = instr_blocks[i + 1][0] if i < len(instr_blocks) - 1 else term_block
eval_objective_func = next_block == term_block
_populate_instruction_block(
i,
forest,
root_func,
tree_func_chunk,
setup_block,
next_block,
eval_objective_func,
)
Could one just "manually inline" tree_funcs[i : i + block_size]
or would you rather add a function generate_inlined_tree_chunck(..., fbsize: int)
?
I am also asking because with finline=False
compile times are a lot faster, but do you think that the time would further reduce if the manually inlined IR is directly optimised and compiled?
Not sure I understand the question. In the current state, the forest is a function (forest_root
) and each tree is a function. The forest function calls the tree functions. In the inlining pass, the tree functions are inlined into the forest function. Instead, lleaves could generate the code for the trees directly into the code of the forest function, instead of making a new function for each tree.
Actually now that I'm thinking about it I think the inlining is maybe a red herring. I'm not sure anymore how I ran these benchmarks I posted in the top comment. What may actually be happening is not that the inlining takes a lot of time, it's that the inlining causes the forest function to become very large, which causes the compiler backend to become much slower at codegen. The way to test this would be to:
opt
and compare:
imo if you want the compilation to be faster on your Kubernetes cluster, then implementing more generic arch targeting is going to be much easier and a much more certain win than the inlining.
Thanks a lot for sharing those insights and ideas. Let me try to explain my reasoning behind the question: I was thinking how one would implement the manual inlining using as much of the existing code and I was assuming that one would not inline more than fbsize
tree functions at once - since each instruction block contains at most fbsize
tree-functions - I thought that inlining all tree functions up-front would become incompatible with the batching when generating the instruction blocks.
I am not familiar with opt
and how it relates to llvmlite
, would you be able to elaborate a bit on that.
I will definitively look into the generic arch targeting topic, thanks for the pointer.
I will implement them at some point myself or I'm also accepting PRs:
tree_root
function. That function is enormous (after inlining every tree), so it takes a long time.For mtpl2 (XACT machine):
What's not clear from these timings: How long would the optimization of the inlined function take? E.g. if we already inlined everything in the frontend.
Ways to mitigate: