imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
771 stars 193 forks source link

K-Histogram algorithm in ranger? #258

Open Mamba413 opened 6 years ago

Mamba413 commented 6 years ago

Would you like to add K-histogram algorithm used in lightgbm (https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment) or xgboost? I think this algorithm is an efficient algorithm which balances bias and variance.

mnwright commented 6 years ago

From the link above

bucketing continuous feature(attribute) values into discrete bins

So, the idea is to reduce the number of split points to evaluate, as in randomised splitting. However, the candidate split points are based on histograms instead of sampling them randomly. Are the bins of equal width? Or should they contain the same amount of training observations?

Mamba413 commented 6 years ago

Generally, these methods try to keep each bin with the same number of observation. There also exist some advanced methods which try to retain the distribution of variables, such as Weighted Quantile Sketch in the paper "XGBoost: A Scalable Tree Boosting System"(http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf).