microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.71k stars 3.83k forks source link

Loss in performance by using init_score #6723

Open dwayne298 opened 4 days ago

dwayne298 commented 4 days ago

Description

I need to use the init_score to provide a prior model but I'm seeing some behaviour I don't understand.

Setup

Query

I would have expected the two situations to give similar performance (due to using the same folds for both models, this setup should be the same as creating one model with 100 trees in both situations) . But I have seen over multiple examples that limiting the trees in the first model will lead to worse results. Is there any reason or intuition for this being the case?

Reproducible example

import lightgbm as lgb
import pandas as pd
import numpy as np
import sklearn.model_selection as skms

# RUN SCRIPT SECOND TIME BUT CHANGING num_iters TO 10
total_iters = 100
num_iters = 20

# create data
np.random.seed(5)
data = pd.DataFrame({
    "a": np.random.random(10_000),
    "b": np.random.random(10_000),
    "c": np.random.random(10_000),
    "d": np.random.random(10_000),
})
data["target"] = np.exp(5 + 3 * data["a"] + data["b"] - 2 * data["c"] + 1.5 * data["d"] + np.random.gamma(0.1, 1, 10_000))

# build first cv model
dataset = lgb.Dataset(
    data=data.drop(["target"], axis=1),
    label=data["target"],
    free_raw_data=False,
)
kf = skms.KFold(n_splits=3, shuffle=True, random_state=309)
kf_splits = kf.split(np.zeros(len(data)))

custom_folds = list()
for train_idx, test_idx in kf_splits:
    custom_folds.append((train_idx, test_idx))

cv_results = lgb.cv(
    params={
        "objective": "gamma", 
        "boosting_type": "gbdt", 
        "n_estimators": num_iters,
        "early_stopping": 5,
        "metric": "gamma_deviance",
    },
    train_set=dataset,
    folds=custom_folds,
    stratified=False,
    return_cvbooster=True,
)

# need cv preds to feed into second model - check my cv preds give same metric as lightgbm
print(cv_results["valid gamma_deviance-mean"])
def replicate_metrics(num_iters, model):
    list_metrics = []
    cv_preds = []
    for num_iter in range(1, num_iters + 1):
        metric_list = []
        for cv_idx, cv_fold in enumerate(custom_folds):
            mdl_temp = model.boosters[cv_idx]

            # predict from booster
            cv_preds_tmp = mdl_temp.predict(
                dataset.get_data().loc[cv_fold[1]],
                num_iteration=num_iter,
            )

            tmp = data["target"].loc[cv_fold[1]] / (cv_preds_tmp + 1.0e-9)

            metric_list.append(
                    2 * sum(tmp - np.log(tmp) - 1)
            )
            if num_iter == num_iters:
                cv_preds.append(cv_preds_tmp)
        list_metrics.append(np.mean(metric_list))        

    cv_preds = (
        pd.DataFrame(
            {
                "idx": np.concatenate([idx[1] for idx in custom_folds]),
                "cv_pred": np.concatenate(
                    cv_preds
                ),
            }
        )
        .sort_values(by=["idx"])
        .reset_index(drop=True)
        .pop("cv_pred")
    )

    print(list_metrics)

    return cv_preds

cv_preds = replicate_metrics(len(cv_results["valid gamma_deviance-mean"]), cv_results["cvbooster"])

# second model
dataset2 = lgb.Dataset(
    data=data.drop(["target"], axis=1),
    label=data["target"],
    free_raw_data=False,
    init_score=np.log(cv_preds),
)

cv_results2 = lgb.cv(
    params={
        "objective": "gamma", 
        "boosting_type": "gbdt", 
        "n_estimators": total_iters - num_iters,
        "early_stopping": 5,
        "metric": "gamma_deviance",
    },
    train_set=dataset2,
    folds=custom_folds,
    stratified=False,
    return_cvbooster=True,
)  
print(cv_results2["valid gamma_deviance-mean"])

Environment info

Package versions: LightGBM: 4.5.0 numpy: 1.22.3 pandas: 1.4.1 sklearn: 1.1.1

Command(s) you used to install LightGBM

python -m venv venv
venv\Scripts\activate.ps1
python -m pip install -r requirements.txt
dwayne298 commented 1 day ago

I think fundamentally, my initial thoughts were wrong as the init_score I'm feeding in is essentially "test" predictions. That leads to the question - is it possible to feed in different init_score for each fold?