Instruction-parallel multithreading

Right now parallelization is implemented in a straight-forward way in lleaves by partitioning the input-data across threads. Instead, each thread should predict across the whole input-data, but only across a subset of the trees.

Example (100 Trees in forest, 2 threads): Thread 1:

for row_id in range(len(input_data)):
  for tree in trees[0:50]:
    result[row_id] += tree(data[row_id]))
global_result += result

Thread 2:

for row_id in range(len(input_data)):
  for tree in trees[50:100]:
    result[row_id] += tree(data[row_id]))
global_result += result

Ideally this would keep each thread's trees fully in L1i-cache, resulting in super-linear speedups with enough cores, instead of the current linear speedups. Benchmarking is necessary to test how large the overhead from having to use atomic-adds would be.

Caveats:

n_threads would need to be specified during compile()?
The forest_root function would get a more complicated API, making it harder for users to implement their own runtime.

siboehm / lleaves

Instruction-parallel multithreading #3