A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
I need to use the init_score to provide a prior model but I'm seeing some behaviour I don't understand.
Setup
build a cv model with 20 trees
build a second cv model with 80 trees which uses the same folds as the first model, and feeds in cv predictions from first model as init_score
I have early stopping to avoid overfitting (second model stops at 2 trees)
first model alone has performance (cv metric) 2197.57
resulting combined performance is 2194.64
repeat the above but the first model has 10 trees and second has 90 trees
second model ends on 15 trees
first model alone has performance 2483.26
resulting combined performance is 2205.03
Query
I would have expected the two situations to give similar performance (due to using the same folds for both models, this setup should be the same as creating one model with 100 trees in both situations) . But I have seen over multiple examples that limiting the trees in the first model will lead to worse results. Is there any reason or intuition for this being the case?
Reproducible example
import lightgbm as lgb
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
# RUN SCRIPT SECOND TIME BUT CHANGING num_iters TO 10
total_iters = 100
num_iters = 20
# create data
np.random.seed(5)
data = pd.DataFrame({
"a": np.random.random(10_000),
"b": np.random.random(10_000),
"c": np.random.random(10_000),
"d": np.random.random(10_000),
})
data["target"] = np.exp(5 + 3 * data["a"] + data["b"] - 2 * data["c"] + 1.5 * data["d"] + np.random.gamma(0.1, 1, 10_000))
# build first cv model
dataset = lgb.Dataset(
data=data.drop(["target"], axis=1),
label=data["target"],
free_raw_data=False,
)
kf = skms.KFold(n_splits=3, shuffle=True, random_state=309)
kf_splits = kf.split(np.zeros(len(data)))
custom_folds = list()
for train_idx, test_idx in kf_splits:
custom_folds.append((train_idx, test_idx))
cv_results = lgb.cv(
params={
"objective": "gamma",
"boosting_type": "gbdt",
"n_estimators": num_iters,
"early_stopping": 5,
"metric": "gamma_deviance",
},
train_set=dataset,
folds=custom_folds,
stratified=False,
return_cvbooster=True,
)
# need cv preds to feed into second model - check my cv preds give same metric as lightgbm
print(cv_results["valid gamma_deviance-mean"])
def replicate_metrics(num_iters, model):
list_metrics = []
cv_preds = []
for num_iter in range(1, num_iters + 1):
metric_list = []
for cv_idx, cv_fold in enumerate(custom_folds):
mdl_temp = model.boosters[cv_idx]
# predict from booster
cv_preds_tmp = mdl_temp.predict(
dataset.get_data().loc[cv_fold[1]],
num_iteration=num_iter,
)
tmp = data["target"].loc[cv_fold[1]] / (cv_preds_tmp + 1.0e-9)
metric_list.append(
2 * sum(tmp - np.log(tmp) - 1)
)
if num_iter == num_iters:
cv_preds.append(cv_preds_tmp)
list_metrics.append(np.mean(metric_list))
cv_preds = (
pd.DataFrame(
{
"idx": np.concatenate([idx[1] for idx in custom_folds]),
"cv_pred": np.concatenate(
cv_preds
),
}
)
.sort_values(by=["idx"])
.reset_index(drop=True)
.pop("cv_pred")
)
print(list_metrics)
return cv_preds
cv_preds = replicate_metrics(len(cv_results["valid gamma_deviance-mean"]), cv_results["cvbooster"])
# second model
dataset2 = lgb.Dataset(
data=data.drop(["target"], axis=1),
label=data["target"],
free_raw_data=False,
init_score=np.log(cv_preds),
)
cv_results2 = lgb.cv(
params={
"objective": "gamma",
"boosting_type": "gbdt",
"n_estimators": total_iters - num_iters,
"early_stopping": 5,
"metric": "gamma_deviance",
},
train_set=dataset2,
folds=custom_folds,
stratified=False,
return_cvbooster=True,
)
print(cv_results2["valid gamma_deviance-mean"])
I think fundamentally, my initial thoughts were wrong as the init_score I'm feeding in is essentially "test" predictions. That leads to the question - is it possible to feed in different init_score for each fold?
Description
I need to use the init_score to provide a prior model but I'm seeing some behaviour I don't understand.
Setup
build a cv model with 20 trees
build a second cv model with 80 trees which uses the same folds as the first model, and feeds in cv predictions from first model as init_score
I have early stopping to avoid overfitting (second model stops at 2 trees)
first model alone has performance (cv metric) 2197.57
resulting combined performance is 2194.64
repeat the above but the first model has 10 trees and second has 90 trees
second model ends on 15 trees
first model alone has performance 2483.26
resulting combined performance is 2205.03
Query
I would have expected the two situations to give similar performance (due to using the same folds for both models, this setup should be the same as creating one model with 100 trees in both situations) . But I have seen over multiple examples that limiting the trees in the first model will lead to worse results. Is there any reason or intuition for this being the case?
Reproducible example
Environment info
Package versions: LightGBM: 4.5.0 numpy: 1.22.3 pandas: 1.4.1 sklearn: 1.1.1
Command(s) you used to install LightGBM