siboehm / lleaves

Compiler for LightGBM gradient-boosted trees, based on LLVM. Speeds up prediction by ≥10x.
https://lleaves.readthedocs.io/en/latest/
MIT License
333 stars 28 forks source link

Ideas for how to make compile time faster without affecting performance #48

Open siboehm opened 1 year ago

siboehm commented 1 year ago

I will implement them at some point myself or I'm also accepting PRs:

  1. The main reason why it requires so much memory & time is that most optimization passes run on the whole tree_root function. That function is enormous (after inlining every tree), so it takes a long time.
  2. The LLVM inlining pass itself is also pretty slow
  3. The compilation process really is pretty straightforward, so lleaves should be bottlenecked by disk speed, not hogging CPU (as it currently is).

For mtpl2 (XACT machine):

What's not clear from these timings: How long would the optimization of the inlined function take? E.g. if we already inlined everything in the frontend.

Ways to mitigate:

  1. Better optim pass selection: Figure out which other optimization pass(es) are affecting the runtime, by brute-force searching through each combination. Figure out how much those passes affect compiletime. Either turn them on, or figure out how to implement them ourselves in the frontend.
  2. Speed up / avoid the inlining pass: Either generate the code already inlined, or write our own inlining pass that's much faster. The inlining is really straightforward here (no shared state etc), so a custom inliner show be OOMs faster. (Actually this could be a red herring, the inlining probably just makes the backend passes way slower, to be tested).
  3. Generic arch targeting: Allow specifying the target architecture through the llvmlite interface so that people can compile for their most generic arch.
  4. Group each instruction cache block into a function: see here
trendelkampschroer commented 8 months ago

@siboehm this manual inlining, how would one go about implementing it? There is


def _populate_forest_func(forest, root_func, tree_funcs, fblocksize):
    """Populate root function IR for forest"""

    assert fblocksize > 0
    # generate the setup-blocks upfront, so each instruction_block can be passed its successor
    instr_blocks = [
        (
            root_func.append_basic_block("instr-block-setup"),
            tree_funcs[i : i + fblocksize],
        )
        for i in range(0, len(tree_funcs), fblocksize)
    ]
    term_block = root_func.append_basic_block("term")
    ir.IRBuilder(term_block).ret_void()
    for i, (setup_block, tree_func_chunk) in enumerate(instr_blocks):
        next_block = instr_blocks[i + 1][0] if i < len(instr_blocks) - 1 else term_block
        eval_objective_func = next_block == term_block
        _populate_instruction_block(
            i,
            forest,
            root_func,
            tree_func_chunk,
            setup_block,
            next_block,
            eval_objective_func,
        )

Could one just "manually inline" tree_funcs[i : i + block_size] or would you rather add a function generate_inlined_tree_chunck(..., fbsize: int)?

I am also asking because with finline=False compile times are a lot faster, but do you think that the time would further reduce if the manually inlined IR is directly optimised and compiled?

siboehm commented 8 months ago

Not sure I understand the question. In the current state, the forest is a function (forest_root) and each tree is a function. The forest function calls the tree functions. In the inlining pass, the tree functions are inlined into the forest function. Instead, lleaves could generate the code for the trees directly into the code of the forest function, instead of making a new function for each tree.

Actually now that I'm thinking about it I think the inlining is maybe a red herring. I'm not sure anymore how I ran these benchmarks I posted in the top comment. What may actually be happening is not that the inlining takes a lot of time, it's that the inlining causes the forest function to become very large, which causes the compiler backend to become much slower at codegen. The way to test this would be to:

  1. either setup better profiling of the LLVM middle-end and backend-passes to see if this hypothesis is true.
  2. Or use opt and compare:
    • Compiling the llvm ir generated by lleaves as is
    • Take the llvm ir generated by lleaves, measure how long it takes opt to inline all functions, then measure how long it takes to generate the binary from the opt output.

imo if you want the compilation to be faster on your Kubernetes cluster, then implementing more generic arch targeting is going to be much easier and a much more certain win than the inlining.

trendelkampschroer commented 8 months ago

Thanks a lot for sharing those insights and ideas. Let me try to explain my reasoning behind the question: I was thinking how one would implement the manual inlining using as much of the existing code and I was assuming that one would not inline more than fbsize tree functions at once - since each instruction block contains at most fbsize tree-functions - I thought that inlining all tree functions up-front would become incompatible with the batching when generating the instruction blocks.

I am not familiar with opt and how it relates to llvmlite, would you be able to elaborate a bit on that.

I will definitively look into the generic arch targeting topic, thanks for the pointer.