siboehm / lleaves

Compiler for LightGBM gradient-boosted trees, based on LLVM. Speeds up prediction by ≥10x.
https://lleaves.readthedocs.io/en/latest/
MIT License
343 stars 29 forks source link

Only one CPU is used in prediction #26

Closed jiazou-bigdata closed 2 years ago

jiazou-bigdata commented 2 years ago

Hi, We installed lleaves using pip install lleaves. We found that the prediction can only utilize one CPU core, though we set n_jobs=8, while we have 8 CPU cores. This is inconsistent with the lleaves code here: https://github.com/siboehm/lleaves/blob/master/lleaves/lleaves.py#L140

Why will that happen? Any suggestions are highly appreciated.

siboehm commented 2 years ago

That sounds very strange, the multi threading should be super robust as it's just a ThreadPoolExecutor from the standard Python library, see relevant lleaves code. There's no locking internally in lleaves and it drops the GIL upon calling into the binary. I just checked using the most up-to-date dependencies, and cannot reproduce this.

What kind of hardware are you running it on? Are you passing enough data (n_rows >> n_jobs)? Are you certain you're actually measuring the prediction? (vs measuring the compilation, which is always single-threaded and can take pretty long).

jiazou-bigdata commented 2 years ago

@siboehm

My bad. And you are right: It is running the compilation step before the prediction. It's fast for 10 trees but it's very slow for 500 trees and 1600 trees.

40.80% libLLVM-11.so [.] llvm::LoopBase<llvm::MachineBasicBlock, llvm::MachineLoop>::getExi 25.49% libLLVM-11.so [.] llvm::SmallPtrSetImplBase::FindBucketFor 20.14% libLLVM-11.so [.] llvm::MachineInstr::isIdenticalTo 7.93% libLLVM-11.so [.] llvm::MachineOperand::isIdenticalTo 2.21% libLLVM-11.so [.] (anonymous namespace)::MachineLICMBase::HoistOutOfLoop

Any way to accelerate this step?

siboehm commented 2 years ago

Not right now, though this is something I want to address in the future. It takes so long since the binary basically consists of a single enormous function, as everything is being inlined, which makes the optimization passes quite slow. The way to mitigate this would be to split the function into multiple compilation units (which would have to be done manually), and then compile them in parallel.

For now all I can recommend is compiling it once and caching the result (via lleaves_model.compile(cache=<some path>)).

jiazou-bigdata commented 2 years ago

This is very helpful. Thank you!