Open inkrement opened 1 year ago
Not quite sure if I understand the problem. You're running N trees over the same dataset D to get N predictions, correct? Lleaves already parallelizes inference over the data in a very simple way, see here: https://github.com/siboehm/lleaves/blob/master/lleaves/lleaves.py#L183 You want to additionally parallelise over the models N?
If the dataset is big enough in relation to your amount of CPU cores, the data parallelism should stress your system enough st additional parallelisation will be a minor benefit at best. If the dataset isn't big enough or you have tons of cores, I guess you could additionally parallelise each model using plain Python multiprocessing or use the low-level C API and write your own low overhead parallelism.
Yes, N independent trees over the same dataset to get N predictions. It is interesting that lleaves automatically parallelizes inference. Although I use it on a 20+ core Intel CPU, I haven't seen a higher CPU utilization than 120%. Maybe I should pass bigger batches (I have tried 15k-100k observations á 150 features). I'll take a closer look. By the way, thanks for this fantastic piece of software!
That sounds like a big enough dataset that it should definitely parallelise well! Lmk what you find. I'd probably try:
I'll investigate it in more detail, but I can rule out (1) I am loading pre-compiled/cached models (3) the data is already in memory
I would like to run multiple regression models at once. All use the same input, is there a way to parallelize the inference? Right now I apply them sequentially. Thank you in advance!