Closed tclv closed 6 years ago
This might be helpful: https://github.com/Microsoft/LightGBM/issues/47#issuecomment-266725875
I personally like the rf mode of lightgbm too. It is fast enough to use it for active learning on big data sets.
Hi Goraj, could you elaborate more on what you are referring to? I have tried fiddling with Isotonic regression, but it seems rather patchworky, and the performance is also lacking.
@Tclv I currently only use the rf mode for binary classifications. I just remembered reading about this and thought it might be related to your issue. Maybe @Laurae2 can help.
With LightGBM in Random Forest mode, it does not matter what were the previous trees because the built trees do not matter on previous trees (it just piles up trees and averages them to predict).
There is no convergence possible with Random Forest, because it is similar to a 1-iteration Gradient Boosting.
Early stopping does not matter (should never be used) in Random Forest, it makes no sense to use it as it's a random process unlike Gradient Boosting which is an optimization process.
@Laurae Thanks for answering!
Early stopping does not matter (should never be used) in Random Forest, it makes no sense to use it as it's a random process unlike Gradient Boosting which is an optimization process.
Right, that makes sense.
There is no convergence possible with Random Forest, because it is similar to a 1-iteration Gradient Boosting.
I am not sure if I am interpreting this right, but the results from 1 iteration of gbrt
and rf
should be similar? This is what I find odd about the current behavior of rf
with lightgbm and maybe my example is not concise enough. If I put n_estimators
to 1 for both rf
and gbrt
, the results of rf
are quite off, compared to gbrt
, (0.6936 vs 0.2611). It seems odd as 0.6936 is quite a steep overprediction, that I have observed over different datasets when using poisson regression in conjunction with rf
. I was wondering what is causing this discrepancy?
@Tclv you can try to set boost_from_average=false
for both gbdt and rf.
That is disabled in rf, while is enabled by default in gbdt mode.
@guolinke
The averages for gbrt
are 0.963 after 1 iteration, slowly converging down to 0.269 after 80 iterations when boost_from_average=false
For rf
the averages stay on 0.694 regardless of boost_from_average
and number of iterations.
@Tclv the learning_rate is fixed to 1
in RF mode. So you should set learning_rate
to 1 in GBDT for the one iteration comparison.
One suggestion for RF: you can let each tree to be more "over-fitting", by using larger num_leaves
, max_depth
and smaller num_data_in_leaf
. And bagging_fraction=0.01
seems too small.
@guolinke
params = {
'task': 'train',
'boosting_type': 'gbrt',
'objective': 'poisson',
'metric': {'poisson'},
'num_leaves': 2**10,
'learning_rate': 1,
'verbose': 0,
'feature_fraction': 0.9,
'bagging_fraction': 0.9,
'bagging_freq': 1,
'n_estimators': 1,
'boost_from_average': False,
}
With these parameters (and varying boost_from_average
), I get the following results for
pd.Series(gbm.predict(X[:train])).mean()
boost_from_average |
True |
False |
---|---|---|
rf |
0.69848 | 0.69848 |
gbrt |
0.29658 | 0.69848 |
Increasing n_estimators
to 40:
boost_from_average |
True |
False |
---|---|---|
rf |
0.69815 | 0.69815 |
gbrt |
0.26082 | 0.26067 |
@Tclv The results look correct and expected now.
I am not sure if I am interpreting this right, but the results from 1 iteration of gbrt and rf should be similar?
Yes.
It seems odd as 0.6936 is quite a steep overprediction
Random Forest usually requires post-calibration of predictions (it is not reasonable to expect Random Forest prediction means to be near the the mean of the real distribution, especially with non standard loss functions such as Poisson loss).
I've been working with the Random Forest algorithm in LightGBM over the past day and I've ran into some unexpected behavior. Gamma regression seemed unstable but linearly scaling the labels to a smaller range seemed to solve this instability. This was documented in a previous issue #1320 (and potentially fixed in one of the release candidates #1322 by setting max_delta, although this did not quite work for me; scaling did however).
After that I worked on Poisson regression, but I cannot seem to tune lightGBM in such a way that it produces sane output when using Random Forest. The core of the problem seems that it sets a very poor first tree (which seems to grossly overestimate the average) and fails to improve on this. Switching to the GBRT algorithm immediately solves the problem, but I am still interested if Random Forest can be used on my dataset.
Below is a piece of code that sort of mimics the problem I have on my dataset.
The mean of the last line varies significantly between
rf
andgbrt
. Forrf
it is 0.684. Forgbrt
it is 0.257. Compared to the actual average of 0.266 GBM seems to converge a lot better.gbrt
also seems to be able to train longer before early stopping (20 vs 2 iterations).Is this intended behavior for Poisson Regression with Random Forest? The performance difference between
gbrt
andrf
seems too high.