stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.63k stars 214 forks source link

NGBRegressor results in too small standard deviation with LGBMRegressor base learner #282

Closed tbezdan closed 2 years ago

tbezdan commented 2 years ago

I tried NGBoost regressor with LGBMRegressor base learner, however the standard deviation in the scale parameter is too small, and as I increase n_estimators, the standard deviation is getting even smaller.

ngb = NGBRegressor(Base=model, Dist=Normal, Score=MLE, natural_gradient=True, verbose=True, n_estimators=500)
ngb.fit(X_train, y_train)
y_pred = ngb.predict(X_test)
y_dists = ngb.pred_dist(X_test)

Below I am sending the first five actual values and the obtained loc and scale parameters:

y_test[0:5]
array([1914, 1916, 2167, 2512, 2524])
y_dists[0:5].params
{'loc': array([2183.18961312, 1730.21820623, 1974.10724466, 2570.05087747,
        1875.03671811]),
 'scale': array([ 2.95513202,  6.60446968,  9.22563868,  4.399603  , 10.02550103])}

The standard deviation for this particular problem supposes to be around 200-500.

alejandroschuler commented 2 years ago

The scale parameter corresponds to the conditional standard deviation. Even if the marginal value is large, the conditional can be much, much smaller (i.e. the marginal variation arises because of large variations in the conditional mean). See: the law of total variance.

That said, in reality the data-generating distribution for your problem may actually have larger conditional variances than what are estimated here. That occurs if you have overfit the model: as the conditional mean adapts more and more closely to the observed training data, the conditional variance needs to be less and less to accommodate any residual error. Best practice is to cross-validate over model hyperparameters (most importantly n_estimators) using a metric of interest like negative log likelihood. This combats overfitting.

Lastly, when you say you're using LGBM as base learner, I'm not sure if you're using a single tree from the LGBM package or an entire boosting model. If it's the latter, then you are almost certainly overfitting very, very quickly. NGBoost usually works best with simple, fast base learners like single regression trees.