microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

Random forest Poisson Regression #1431

Closed tclv closed 6 years ago

tclv commented 6 years ago

I've been working with the Random Forest algorithm in LightGBM over the past day and I've ran into some unexpected behavior. Gamma regression seemed unstable but linearly scaling the labels to a smaller range seemed to solve this instability. This was documented in a previous issue #1320 (and potentially fixed in one of the release candidates #1322 by setting max_delta, although this did not quite work for me; scaling did however).

After that I worked on Poisson regression, but I cannot seem to tune lightGBM in such a way that it produces sane output when using Random Forest. The core of the problem seems that it sets a very poor first tree (which seems to grossly overestimate the average) and fails to improve on this. Switching to the GBRT algorithm immediately solves the problem, but I am still interested if Random Forest can be used on my dataset.

Below is a piece of code that sort of mimics the problem I have on my dataset.

import pandas as pd
import numpy as np
np.random.seed(42)
features=10
N = 20000
noise = 0.05
dummies = 10
train = 15000

X1 = pd.DataFrame(np.random.normal(size=(N, features)))
X2 = pd.DataFrame(np.random.normal(size=(N, dummies)))

X = pd.concat([X1,X2], axis=1)
Y = pd.Series(np.random.poisson((X1.mean(axis=1).apply(lambda x: max(x, 0.2)))).astype('float'))
Y += np.random.normal(size=N) * noise
Y = Y.apply(lambda x: max(0, x))

Y.mean() # 0.2669

import lightgbm as gbm

params = {
    'task': 'train',
    'boosting_type': 'rf', #'gbrt'
    'objective': 'poisson',
    'metric': {'poisson'},
    'num_leaves': 2**4,
    'learning_rate': 0.1,
    'verbose': 0,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.01,
    'bagging_freq': 1,
    'n_estimators': 5000,
    'max_delta_step': -1,
}

gbm = gbm.LGBMRegressor(**params)

gbm.fit(X.iloc[:train], 
        Y.iloc[:train], 
        verbose=True,
        eval_set=[(X.iloc[train:], Y.iloc[train:])],
        eval_metric={'poisson'},
        early_stopping_rounds=20,
       )
pd.Series(gbm.predict(X)).mean()

The mean of the last line varies significantly between rf and gbrt. For rf it is 0.684. For gbrt it is 0.257. Compared to the actual average of 0.266 GBM seems to converge a lot better. gbrt also seems to be able to train longer before early stopping (20 vs 2 iterations).

Is this intended behavior for Poisson Regression with Random Forest? The performance difference between gbrt and rf seems too high.

goraj commented 6 years ago

This might be helpful: https://github.com/Microsoft/LightGBM/issues/47#issuecomment-266725875

I personally like the rf mode of lightgbm too. It is fast enough to use it for active learning on big data sets.

tclv commented 6 years ago

Hi Goraj, could you elaborate more on what you are referring to? I have tried fiddling with Isotonic regression, but it seems rather patchworky, and the performance is also lacking.

goraj commented 6 years ago

@Tclv I currently only use the rf mode for binary classifications. I just remembered reading about this and thought it might be related to your issue. Maybe @Laurae2 can help.

Laurae2 commented 6 years ago

With LightGBM in Random Forest mode, it does not matter what were the previous trees because the built trees do not matter on previous trees (it just piles up trees and averages them to predict).

There is no convergence possible with Random Forest, because it is similar to a 1-iteration Gradient Boosting.

Early stopping does not matter (should never be used) in Random Forest, it makes no sense to use it as it's a random process unlike Gradient Boosting which is an optimization process.

tclv commented 6 years ago

@Laurae Thanks for answering!

Early stopping does not matter (should never be used) in Random Forest, it makes no sense to use it as it's a random process unlike Gradient Boosting which is an optimization process.

Right, that makes sense.

There is no convergence possible with Random Forest, because it is similar to a 1-iteration Gradient Boosting.

I am not sure if I am interpreting this right, but the results from 1 iteration of gbrt and rf should be similar? This is what I find odd about the current behavior of rf with lightgbm and maybe my example is not concise enough. If I put n_estimators to 1 for both rf and gbrt, the results of rf are quite off, compared to gbrt, (0.6936 vs 0.2611). It seems odd as 0.6936 is quite a steep overprediction, that I have observed over different datasets when using poisson regression in conjunction with rf. I was wondering what is causing this discrepancy?

guolinke commented 6 years ago

@Tclv you can try to set boost_from_average=false for both gbdt and rf. That is disabled in rf, while is enabled by default in gbdt mode.

tclv commented 6 years ago

@guolinke The averages for gbrt are 0.963 after 1 iteration, slowly converging down to 0.269 after 80 iterations when boost_from_average=false

For rf the averages stay on 0.694 regardless of boost_from_average and number of iterations.

guolinke commented 6 years ago

@Tclv the learning_rate is fixed to 1 in RF mode. So you should set learning_rate to 1 in GBDT for the one iteration comparison.

One suggestion for RF: you can let each tree to be more "over-fitting", by using larger num_leaves, max_depth and smaller num_data_in_leaf. And bagging_fraction=0.01 seems too small.

tclv commented 6 years ago

@guolinke

params = {
    'task': 'train',
    'boosting_type': 'gbrt',
    'objective': 'poisson',
    'metric': {'poisson'},
    'num_leaves': 2**10,
    'learning_rate': 1,
    'verbose': 0,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,
    'n_estimators': 1,
    'boost_from_average': False,
}

With these parameters (and varying boost_from_average), I get the following results for pd.Series(gbm.predict(X[:train])).mean()

boost_from_average True False
rf 0.69848 0.69848
gbrt 0.29658 0.69848

Increasing n_estimators to 40:

boost_from_average True False
rf 0.69815 0.69815
gbrt 0.26082 0.26067
Laurae2 commented 6 years ago

@Tclv The results look correct and expected now.

I am not sure if I am interpreting this right, but the results from 1 iteration of gbrt and rf should be similar?

Yes.

It seems odd as 0.6936 is quite a steep overprediction

Random Forest usually requires post-calibration of predictions (it is not reasonable to expect Random Forest prediction means to be near the the mean of the real distribution, especially with non standard loss functions such as Poisson loss).