Regression result appears to depend highly on the base score (setting init_score to some constant value)

sktin commented 3 weeks ago

Description

This investigation started when it was found that the result of regression appeared to depend on the constant value set in init_score (called "base score" in xgboost).

The result should be independent of a constant init_score if the model can fit a constant function perfectly, so that any "bad" choice of init_score would eventually be corrected. Apparently, this if is not true for lightgbm.

Reproducible example

from sklearn.model_selection import KFold
import lightgbm as lgb
import numpy as np
from itertools import product

rng = np.random.default_rng(0)
N = 10000
X = rng.random((N,10))
kfold = KFold(5)
targets = {
    'y=1': np.ones(N),
    'y=1.001': np.ones(N)+1e-3,
    'y=1.000001': np.ones(N)+1e-6,
    'y=1+rand': np.ones(N)+rng.random(N)*1e-6
}
init_scores = [0, 1]

for t, init_score in product(targets.keys(), init_scores):
    print(F'### {t}, {init_score=}')
    lgb.cv(
        params={'verbose':-1},
        train_set=lgb.Dataset(X, targets[t], init_score=[init_score]*len(X)),
        num_boost_round=10000,
        folds=kfold,
        callbacks=[
            lgb.early_stopping(stopping_rounds=100, verbose=True),
        ]
    )
print(F'{lgb.__version__=}')

Output:

### y=1, init_score=0
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1] cv_agg's valid l2: 1 + 0
### y=1, init_score=1
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1] cv_agg's valid l2: 0 + 0
### y=1.001, init_score=0
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[22]    cv_agg's valid l2: 0.00971714 + 0
### y=1.001, init_score=1
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1] cv_agg's valid l2: 1.00009e-06 + 0
### y=1.000001, init_score=0
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[6] cv_agg's valid l2: 0.28243 + 0
### y=1.000001, init_score=1
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1] cv_agg's valid l2: 9.09495e-13 + 0
### y=1+rand, init_score=0
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[171]   cv_agg's valid l2: 8.99165e-14 + 2.31042e-15
### y=1+rand, init_score=1
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[36]    cv_agg's valid l2: 8.55969e-14 + 1.85783e-15
lgb.__version__='4.5.0'

Environment info

LightGBM version or commit hash: 4.5.0

Command(s) you used to install LightGBM

pip install lightgbm==4.5.0

Additional Comments

jameslamb commented 1 week ago

Thanks for using LightGBM.

I understand the claims you're making, but not how the code you've provided is related to those claims.

Here's a simpler reproducible example in Python:

import lightgbm as lgb
import numpy as np

rng = np.random.default_rng(0)
X = rng.random((10_000, 10))
y = np.ones(X.shape[0])

bst = lgb.train(
    params={"objective": "regression"},
    train_set=lgb.Dataset(X, label=y, init_score=np.full_like(y, fill_value=10.0)),
    num_boost_round=10
)

bst.predict(X)
# array([0., 0., 0., ..., 0., 0., 0.])

This is expected behavior. When you do not provide an init_score, LightGBM will take some representative "average" of the target and start boosting from there. For the built-in regression objective, that is literally the arithmetic mean of the target.

You might find this recent discussion on that interesting: https://github.com/microsoft/LightGBM/pull/6569#discussion_r1791122758

When you DO provide an init_score, LightGBM skips that step and instead starts boosting from that init_score. However, it intentionally will never predict that init_score... it will just use it to evaluate the gain of potential splits, and then use the leaf values from those splits to set the predicted values.

In the example you've described, it's impossible for LightGBM to make any splits... if every value of the target is identical, then the gain of every potential split is 0.0. As a result, in this situation you'll get a lot of these warnings in logs:

[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

And then predict() will always just return 0.0.

So it's not that the results "depend highly on the base score" ... it's that LightGBM's behavior differs based on whether or not you provide an init_score at all.

All the other variation seen in the logs you've provided is unrelated noise, caused by the code snippet you've provided not controlling for randomness in several places, and comparing 3 examples with a constant target to 1 example with a target that randomly varies independently of the features.

The result should be independent of a constant init_score if the model can fit a constant function perfectly, so that any "bad" choice of init_score would eventually be corrected.

This is just not correct. If all values of the target are identical, then the model cannot possibly learn any relationship between the features and the target.

sktin commented 1 week ago

Thank you for explaining the behavior of lightgbm with respect to custom init_score when fitting a constant function. I was under the impression that with enough boosting steps, GBDT models could recover gracefully from bad initial choice.

I double checked the behavior of xgboost. It seems that it can recover from bad choice of base_score (initial global bias) but not with bad choice of base_margin (which would be the equivalent of init_score in lightgbm). Conceptually they both boost from a certain constant level, but apparently do different things under the hood.

import xgboost as xgb
import numpy as np

rng = np.random.default_rng(0)
X = rng.random((10_000, 10))
y = np.ones(X.shape[0])

params = {"objective": "reg:squarederror", 'reg_lambda': 0}
bst = xgb.train(
    params, xgb.DMatrix(
        X, y, base_margin=np.full_like(y, fill_value=10.0)
    ),
    num_boost_round=100
)
print('# base_margin=10')
print(bst.predict(xgb.DMatrix(X)))

bst = xgb.train(
    params|{'base_score': 10.0}, xgb.DMatrix(
        X, y
    ),
    num_boost_round=100
)
print('# base_score=10')
print(bst.predict(xgb.DMatrix(X)))

Output:

# base_margin=10
[-8. -8. -8. ... -8. -8. -8.]
# base_score=10
[1.0000001 1.0000001 1.0000001 ... 1.0000001 1.0000001 1.0000001]

An output of -8 is no better than 0 (which are both wrong), I see no reason to pick on lightgbm in particular. Unless you have further comment, I would consider it case closed.

jameslamb commented 1 week ago

I was under the impression that with enough boosting steps, GBDT models could recover gracefully from bad initial choice.

They can if there is any meaningful relationship between the features and the target.

Thanks for the XGBoost example, yes that's a good illustration of how this is not specific to LightGBM. We can keep this closed.

microsoft / LightGBM