[Feature] Random Forest mode

microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

https://lightgbm.readthedocs.io/en/latest/

MIT License

16.61k stars 3.83k forks source link

[Feature] Random Forest mode #47

Closed Laurae2 closed 7 years ago

Laurae2 commented 7 years ago

Just wondering, are there any plan for modifications to be able to use LightGBM in Random Forest mode? It would be really interesting to see how LightGBM fares in a (potential) Random Forest mode, both in terms of speed/performance vs xgboost, H2O, Python scikit-learn, and R (LightGBM could be much faster than any of them?).

guolinke commented 7 years ago

The most optimizations of LightGBM is in decision tree learning. So I think it will be still fast in RF. We are interesting for the implementation of RF, but I would like to call the community to contribute this first.

gugatr0n1c commented 7 years ago

Voting for this as well, with options like this would be great:

Similary as in xgboost:

just call boosting with only one iteration.
allow to set num_of_trees for each iteraration (for boosting default = 1)
- and better than xgboost, but same as in scikit, allow to calculate out_of_bag prediciton as option.

Other parameters can be used as well.

guolinke commented 7 years ago

@gugatr0n1c , good suggestion. I think a more simple solution is: inherit RF from GBDT, and overwrite boosting method with empty. And set bagging. And I want to know, what is the purposes of "calculate out_of_bag prediciton"?

gugatr0n1c commented 7 years ago

Out of bag prediction is from practical experience. Many tasks are about feature engineering. Then RF with OOB prediction is fastest way to test progress.

So my typical usecase is as follows: 1] find out with hyper parameters tuning, that best params for boosting are ie. learning_rate = 0.002 with 25k iterations. This with 1M rows datasets with 10fold cross validation runs 2 days, even with your library - this is actually final step for building model, because this gives me the best accuracy.

2] But before that, when I working on feature engineering I need to try several features. This can be done easy with RF with just 250 trees. With out-of-bag prediction I even dont need to do cross validation.

I just tried several times, that better feature for RF is also better feature for boosting (because both using trees ) - or very likely. Also OOB gives me accuracy over whole datasets just like that, without I need 10fold CV.

So this is my motivation, I believe this not unusual way to build great model with strong features. Thx for consideration.

wxchan commented 7 years ago

There is a random forest implementation in xgboost?

I tried to do some experiment for random forest but the performance is not good.

I implemented it in following way, what did I miss:

inherit from gbdt
only boosting once at iter = 0
no shrinking and updating training score during training

bagging with replacement from n samples to n samples (not exact, but expectation)

p = bagging_fraction
for (int i=0; i<num_data; ++i) {
if (Random.nextDouble() < p) {
bag_data_indices.push_back(i);
while (Random.nextDouble() < 1 - p) {
  bag_data_indices.push_back(i);
}
} else {
out_of_bag_data_indices.push_back(i);
}
}

resample used_feature_indices every time before finding split point

guolinke commented 7 years ago

@wxchan I think the accuracy of RF cannot beat GBDT in most application. I think you can compare it with the RF in xgboost.

wxchan commented 7 years ago

How to use RF in xgboost? I didn't see it in its document.

guolinke commented 7 years ago

@wxchan https://github.com/dmlc/xgboost/blob/b94fcab4dc90cfd6c0a80ac6dc3ea0ad154543a6/src/gbm/gbtree.cc#L35

guolinke commented 7 years ago

@wxchan , I think using num_parallel_tree maybe is a better way. In this way, we can support random gbdt-forest.

wxchan commented 7 years ago

Ok they really didn't mention it in document. It trains num_tree * num_parallel_tree trees. I guess it did bagging as usual and didn't resample used features before every split. And oob score may not be useful here.

According to Tianqi Chen's answer at Quora, I guess he might consider random forest is not needed any more. He only added the advantages of random forest to improve gbdt.

guolinke commented 7 years ago

@wxchan , OK , I see. One more question is: Does these GBDTs train independently? Or they are merged(e.g. average the output of trees) after one iteration then do boosting?

guolinke commented 7 years ago

@wxchan , I think you should take care of the logic of histogram subtraction. If you use feature randomly in every split, we cannot cache histograms for some features.

wxchan commented 7 years ago

It simply trains n trees and update score_updater once. So just add a loop outside the num_class loop. To be honest, I think making the logic too complicated is not worth. Actually we already have a model between gbdt and rf, which is dart.

gugatr0n1c commented 7 years ago

I tried several times n_tree in xgboost and perfomance was not better, so if the argument is to use it inside boosting as well, it is not worth it, and inherit it to second class will be better maybe...

gugatr0n1c commented 7 years ago

Core usage when we have boosting methods is about speed. When you building app, where you need fastest prediction as you can get (ie in-time motion) and RF-accuracy is fine, then we still go for RF. Because sometime speed matters more than accuracy. Just another view on RF.

guolinke commented 7 years ago

@wxchan OK, so just simple RF is OK. BTW, maybe it is better to not enable subsample by leaf.

Laurae2 commented 7 years ago

@wxchan: performance must not be good in any case (unless you change evaluation metric) because the update does not have bound freedom like in real Random Forest implementations (you are limited by grad/hess in LightGBM/xgboost). If you want to compare with Random Forest implementations, using AUC for classification is recommended (so it is scale-invariant) - for regression, using Spearman correlation coefficient.

As for the prediction outputs (having wrong scale), a simple pass of calibration done by the user must be enough (an isotone regression preferably, but this should be left to the user to perform manually because there are a lot of cases where the user prefers the raw output and not the calibrated outputs).

To compare with real RFs, you have to use a sampling of 0.632 (bootstrap).

ekerazha commented 7 years ago

To compare with real RFs, you have to use a sampling of 0.632 (bootstrap).

Note that real RFs use sampling with replacement. If you sample 0.63 without replacement, you can have the same number of unique samples, but that's not the same thing.