stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.64k stars 215 forks source link

Difference between num_estimator in base learner and NGBRegressor #221

Closed mburaksayici closed 3 years ago

mburaksayici commented 3 years ago

Hi, I'm probably clarified myself on difference between BaseLearner estimator and NGBRegressor, but it's unclear on the documentation. We can use base learners from SKLearn/Lightgbm as it's also applied there https://github.com/ryan-wolbeck/ngboost-tuner/blob/master/ngboost_tuner/tune.py , but I first thought that number of estimators in base learner is equal to NBRegressor's num_estimator. Then I realize that they're differently parameterized, means that we can specify different num_estimators for both.

1) Is num_estimator of base learner a parameter to tune? It seems that the parameter update by gradient descent is more important as I see in the pseudocode of the algorithm, but would like to ask.
2) The default number of estimators are generally 100 for random forest algorithms, and I've seen that in tutorials n_estimators used in NGBoost is 100 https://stanfordmlgroup.github.io/ngboost/1-useage.html , so one may falsely use SKLearn RandomForestRegressor with 100 estimator in default, causing 100100=10000 (inefficient?) iterations? 3) Is there a difference between using DecisionTreeRegressor, RandomForestRegressor or LGBMRegressor since the number of estimators should be used* is 1, and NGBRegressor has its own number of estimators and treats base learner as an independent function? LGBMRegressor uses previous trees if n_estimator is larger than 1,but NGBRegressor won't care it at all since we should use n_estimators=1 in LGBMRegressor. So difference between them depends on these algorithms builts their first trees.

I think I understand what's going on, but some clarification may help every other.

alejandroschuler commented 3 years ago

As an example, let's say you decide to use a 100-tree random forest as the base estimator and you set num_estimator to 500 in ngboost. Then you are boosting a series of 500 100-tree random forests, i.e. you will fit 500 random forest models, or 50000 total trees! (actually 50000 times the number of parameters in your chosen distribution)

The whole idea of boosting is that you turn a weak learner into a strong one by constructing a sequence that corrects the previous errors. So there isn't a good a-priori reason to use a strong model like a random forest as the base learner. The default (regression tree) is a good choice for almost all applications and I wouldn't change that unless you have very good reason to.

mburaksayici commented 3 years ago

That's what I expected, but thanks for the advice for sticking on the regression trees, But I think it should be specified in somewhere, I mistakenly use that, then I found its worth to ask. Thanks!