microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.53k stars 3.82k forks source link

[python-package] Changing leaf values with `set_leaf_output` in callback doesn't work the same as using lgb.train for each epoch. #6009

Open dacian-dataheroes opened 1 year ago

dacian-dataheroes commented 1 year ago

Description

Changing leaf values with set_leaf_output in callback doesn't work the same as using lgb.train for each epoch.

Reproducible example

X, y = make_regression(n_samples=10000, n_features=10)
dataset = lgb.Dataset(data=X, label=y, free_raw_data=False)

def my_callback(env):
    booster = env.model
    num_leaves = booster.dump_model(
        start_iteration=booster.num_trees() - 1, num_iteration=1
    )["tree_info"][0]["num_leaves"]
    for leaf_id in range(num_leaves):
        leaf_output = booster.get_leaf_output(
            tree_id=booster.num_trees() - 1, leaf_id=leaf_id
        )
        booster.set_leaf_output(
            tree_id=booster.num_trees() - 1, leaf_id=leaf_id, value=leaf_output + 1
        )

NUM_BOOST_ROUND = 3
booster1 = lgb.train(
    params=params, train_set=dataset, num_boost_round=3, callbacks=[my_callback]
)

for epoch in range(NUM_BOOST_ROUND):
    if epoch == 0:
        booster2 = lgb.train(
            params=params,
            num_boost_round=1,
            train_set=dataset,
        )
    else:
        booster2 = lgb.train(
            params=params,
            num_boost_round=1,
            train_set=dataset,
            init_model=booster2,
        )

    num_trees = booster2.num_trees()
    num_leaves = booster2.dump_model(start_iteration=num_trees - 1, num_iteration=1)[
        "tree_info"
    ][0]["num_leaves"]
    for leaf_id in range(num_leaves):
        leaf_output = booster2.get_leaf_output(tree_id=num_trees - 1, leaf_id=leaf_id)
        booster2.set_leaf_output(
            tree_id=num_trees - 1, leaf_id=leaf_id, value=leaf_output + 1
        )

# this should be the same, right? They are close, but not the same
print(booster1.get_leaf_output(NUM_BOOST_ROUND-1, 1), booster2.get_leaf_output(NUM_BOOST_ROUND-1, 1))

Environment info

LightGBM version or commit hash: 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

jrvmalik commented 1 year ago

Setting the leaf output does not change the inner prediction buffer. booster._Booster__inner_predict_buffer[0] is basically the train predictions for each iteration. To make boosting work, you would also want to update that by adding the offset from the new leaf scores. Not 100% sure but definitely worth a try.