Right now parallelization is implemented in a straight-forward way in lleaves by partitioning the input-data across threads.
Instead, each thread should predict across the whole input-data, but only across a subset of the trees.
Example (100 Trees in forest, 2 threads):
Thread 1:
for row_id in range(len(input_data)):
for tree in trees[0:50]:
result[row_id] += tree(data[row_id]))
global_result += result
Thread 2:
for row_id in range(len(input_data)):
for tree in trees[50:100]:
result[row_id] += tree(data[row_id]))
global_result += result
Ideally this would keep each thread's trees fully in L1i-cache, resulting in super-linear speedups with enough cores, instead of the current linear speedups.
Benchmarking is necessary to test how large the overhead from having to use atomic-adds would be.
Caveats:
n_threads would need to be specified during compile()?
The forest_root function would get a more complicated API, making it harder for users to implement their own runtime.
Right now parallelization is implemented in a straight-forward way in lleaves by partitioning the input-data across threads. Instead, each thread should predict across the whole input-data, but only across a subset of the trees.
Example (100 Trees in forest, 2 threads): Thread 1:
Thread 2:
Ideally this would keep each thread's trees fully in L1i-cache, resulting in super-linear speedups with enough cores, instead of the current linear speedups. Benchmarking is necessary to test how large the overhead from having to use atomic-adds would be.
Caveats:
n_threads
would need to be specified during compile()?forest_root
function would get a more complicated API, making it harder for users to implement their own runtime.