siboehm / lleaves

Compiler for LightGBM gradient-boosted trees, based on LLVM. Speeds up prediction by ≥10x.
https://lleaves.readthedocs.io/en/latest/
MIT License
364 stars 29 forks source link

Instruction-parallel multithreading #3

Open siboehm opened 3 years ago

siboehm commented 3 years ago

Right now parallelization is implemented in a straight-forward way in lleaves by partitioning the input-data across threads. Instead, each thread should predict across the whole input-data, but only across a subset of the trees.

Example (100 Trees in forest, 2 threads): Thread 1:

for row_id in range(len(input_data)):
  for tree in trees[0:50]:
    result[row_id] += tree(data[row_id]))
global_result += result

Thread 2:

for row_id in range(len(input_data)):
  for tree in trees[50:100]:
    result[row_id] += tree(data[row_id]))
global_result += result

Ideally this would keep each thread's trees fully in L1i-cache, resulting in super-linear speedups with enough cores, instead of the current linear speedups. Benchmarking is necessary to test how large the overhead from having to use atomic-adds would be.

Caveats: