Running on HPC with implicit parallelization

suiji / Arborist

Scalable decision tree training and inference.

Other

82 stars 14 forks source link

Running on HPC with implicit parallelization #33

Open pnsinha opened 7 years ago

pnsinha commented 7 years ago

I am impressed with the speed and memory usage of this package. Could we use it on HPC cluster using more than one node? So far I have been successful in with one node only.

suiji commented 7 years ago

Thank you for enjoying the package. I have been waiting nearly three years for someone to request this feature.

The core implementation is factored in such a way that blocks of trees can be trained independently. The main things missing are the low-level calls to: i) Scatter the blocks among the nodes. ii) Gather the independently-trained trees.

I would prefer to do this in an MPI setting, but some users are really big on Spark. What clustering interface(s) are you interested in?

suiji commented 7 years ago

MPI support has been on the TODO list for some time. While it is fairly straightfoward to implement, other features have taken precedent. Without feedback from potential adopters, there is no motivation to accelerate its introduction or to leave this thread open.

pnsinha commented 7 years ago

Sorry for the delay in my reply. I am interested in using it on XSEDE system like TACC/STAMPEDE where each node typically has 12 cores with 24GB memory. Currently, Rborist works perfectly on one node using all cores. In my application particularly, I am using a large amount of new data, so the prediction is slow.

ck37 commented 7 years ago

If you want multi-node prediction you can do that on your own without much work or requiring code changes to Arborist, using e.g. the doSNOW package or future. Divide up your data and then either package will parallelize across those nodes and allow each worker to run sequential Arborist. Or you could run one multicore Arborist on each node.

suiji commented 7 years ago

For XSEDE access, an under-the-hood MPI implementation should suffice.

Development has recently focused almost exclusively on training, so it's nice to learn of a need for faster prediction. In fact, it's probably a simpler matter to open up prediction to multiple nodes than to do the same for training. The main loop of prediction operates over moderately-sized blocks of contiguous rows, with width equal to the number of trees. An additional level of strip mining could assign the blocks to compute nodes.

@ck37: Your comments appeared as I was composing this. Thank you for chiming in. Can doSnow and the like compute summary statistics over the separately-predicted blocks, or would it be necessary to re-enter the Rborist for final processing?

suiji commented 7 years ago

@ck37: Yours seems like a much better idea than relying on Core code to maintain an MPI layer. For best results, the main training and prediction loops can be refactored for blocked invocation. A Summarize() method could be introduced to operate over the separately-obtained predictions.

suiji commented 5 years ago

Additional hooks into both training and prediction are planned for the 0-2.x series. This should permit both operations to execute in a hierarchically-parallel fashion, with MPI(-like) dispatch to multiple compute nodes and OpenMP threads parallelizing locally on individual nodes.