rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 526 forks source link

[QST] RandomForest binning vs sklearn RandomForest #5511

Closed adam2392 closed 1 year ago

adam2392 commented 1 year ago

What is your question?

Hi in the documentation https://docs.rapids.ai/api/cuml/stable/api/#regression-and-classification, it states that cuml's RF uses binning by default, such that the splits are evaluated on integers (i.e. bins) rather than actual feature values.

Is there some documentation, or experiments, or paper discussing the tradeoffs? Is it faster? When is it more accurate / why? How does this compare with the traditional RF approach taken on by sklearn?

RAMitchell commented 1 year ago

Hi, this binning approach for tree models has been popularised by LightGBM:

Ke, Guolin, et al. "Lightgbm: A highly efficient gradient boosting decision tree." Advances in neural information processing systems 30 (2017).

It appears earlier in works such as:

Stephen Tyree, Kilian Q. Weinberger, Kunal Agrawal, and Jennifer Paykin. 2011. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web (WWW '11). Association for Computing Machinery, New York, NY, USA, 387–396. https://doi.org/10.1145/1963405.1963461

The benefit of quantising available split points is significantly improved speed, reduced memory and ease of distributed implementation.

csadorf commented 1 year ago

@RAMitchell Thanks a lot for providing the background information!

@adam2392 I hope that answers your question, but please let us know if you have more. I'll close the issue for now.