suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

GPU documentation/status #54

Open saraemoore opened 3 years ago

saraemoore commented 3 years ago

Is there documentation or an update available on a GPU-accelerated Rborist (previously referred to elsewhere as "Curborist")? The current Rborist docs state that it is "tuned for multicore and GPU hardware," but I wasn't able to find any details on usage options for GPUs specifically. Thanks in advance for any details you can share!

suiji commented 3 years ago

"Curborist" stood for CUda-enabled Rborist. Repartitioning of the observations was performed on the GPU using a stable partition. The multicore implementation was at least as fast, however, on data sets having fewer than, roughly, ten thousand observations. The most obvious bottlenecks arose from data movement and the fact that we were still performing splitting on the CPU. It is fairly clear how to perform splitting on the GPU using parallel-prefix, even with the additional complication posed by the Random Forests algorithm's variable sampling. This work was not completed, however, as there remained other, more-easily implemented, opportunities to improve performance on the CPU alone.

Having dealt with the difficulties of tracking CUDA across multiple platforms and releases, however, it now appears that a more maintainable approach will be to use the newer features of OpenMP for GPU parallelization, particularly those enabled by versions 5.0 and 5.1. The current plan is to offer a fat binary which will look for a GPU and, upon finding one or more, invoke both repartitioning and splitting on the coprocessor - when it makes sense to do so. In particular, Rborist version 0-3.0 has been extensively refactored to make this possible. Even given this reorganization, though, a truly GPU-capable implementation will not be available before 0-4.0. In particular, compilers supporting the new standards do not appear to be generally available. Last I checked, only Cray and AMD currently offer support. That said, we envision that the only intervention required of the user will be to set the "enableCoprocessor" option to TRUE.

For completeness, I should also point out that there was a proprietary CUDA version developed seven or eight years ago. This was specialized for low-categoricity classification, especially genomic work. It scaled quite nicely with predictor count, with 50x speedup over a bespoke multicore equivalent. The algorithm did not scale beyond roughly one thousand observations, though, and the approach has been abandoned in the open-source versions to follow.

saraemoore commented 3 years ago

Thanks very much for the very thorough response! Kudos for the preparatory work to make this feature possible. I'll keep an eye out for it in future releases.