lightgbm.cv shows it trained way more estimators than are shown by the current_iteration of its component boosters

nitinmnsn commented 1 year ago

Description

Running lightgbm.cv with a lightgbm.callback.early_stopping(50, False) results in a cvbooster whose best_iteration is 2009 whereas the current_iterations() for the individual boosters in the cvbooster are [1087, 1231, 1191, 1047, 1225]. My understanding is that the best_iteration of cvbooster should be exactly max(current_iterations) - early_stopping_rounds.

Reproducible example

Import dependencies:

import pandas as pd
import numpy as np
import lightgbm
from sklearn.datasets import make_classification

Create dummy data

data = make_classification(n_samples = 10_000, n_features = 100, n_informative = 100, n_redundant = 0, 
                           flip_y=0.05,n_clusters_per_class=5,class_sep = 0.5, random_state = 0)
dt = pd.DataFrame(data[0],columns = [f"feat_{i}" for i in range(data[0].shape[1])])
yt = pd.Series(data[1])

Creating the training dataset

train_data = lightgbm.Dataset(dt, label = yt, params = {'max_bin': 335, 'min_data_in_bin': 620,'verbose':-1})

Setting training hyperparameters that would reproduce the behavior

model_params = {'learning_rate': 0.015355286838886862,
 'num_leaves': 96,
 'subsample': 0.9704285838459497,
 'scale_pos_weight': 7.34674002393291,
 'lambda_l1': 5.986584841970366,
 'lambda_l2': 1.5601864044243652,
 'linear_lambda': 1.5599452033620265,
 'min_sum_hessian_in_leaf': 0,
 'max_depth': 25,
 'feature_fraction': 0.15227525095137953,
 'feature_fraction_bynode': 0.8795585311974417,
 'min_gain_to_split': 0.6410035105688879,
 'max_cat_threshold': 156,
 'cat_l2': 71.09918520180851,
 'cat_smooth': 3.0378649352844422,
 'max_cat_to_onehot': 17,
 'cegb_penalty_split': 0.0009699098521619943,
 'path_smooth': 0.8324426408004217,
 'sigmoid': 2.202157195714934,
 'pos_bagging_fraction': 0.4297256589643226,
 'neg_bagging_fraction': 0.5104629857953323,
 'bagging_freq': 16,
 'metric': ['auc'],
 'objective': 'binary',
 'boost_from_average': True,
 'feval': [],
 'boosting_type': 'gbdt',
 'n_estimators': 10000}

Create an early stopping callback

esc = lightgbm.callback.early_stopping(50, False)

Run lightgbm.cv

lgc = lightgbm.cv(params = model_params, train_set = train_data, nfold = 5, callbacks = [esc], return_cvbooster=True, seed = 17)

Check the output

print(lgc['cvbooster'].best_iteration, [i.current_iteration() for i in lgc['cvbooster'].boosters])

output:

2009 [1087, 1231, 1191, 1047, 1225]

Also, if I run the lightgbm.cv for the exact 2009 (the best iteration from the cv run with early stopping) without using early stopping it sometimes gives a different number of current_iteration for the boosters. In this particular case if we run

model_params["n_estimators"] = 2009
lgc1 = lightgbm.cv(params = model_params, train_set = train_data, nfold = 5, 
                return_cvbooster=True, seed = 17, verbose_eval = True)

Then check the number of individual current iterations

print(lgc1['cvbooster'].best_iteration, [i.current_iteration() for i in lgc1['cvbooster'].boosters])

output:

-1 [1087, 1221, 1191, 1047, 1225] #as opposed to [1087, 1231, 1191, 1047, 1225] from the run with early stopping

Environment info

lightgbm - 3.3.5 installed with pip install lightgbm pd.version - 1.5.2 np.version - 1.23.5 sklearn.version - 1.2.1

system - ubuntu 22.10

Additional Comments

jmoralez commented 1 year ago

Hi @nitinmnsn, thanks for the reproducible example! I was able to take a look at this tonight and I see that it's because the individual boosters are reaching this point https://github.com/microsoft/LightGBM/blob/216eaff723e11a84b27ae4275675a46e8c7326ba/src/boosting/gbdt.cpp#L424-L432

The warnings aren't printed in your example because of the 'verbose': -1 in the dataset params, if you remove it you can see them print non-stop.

Just wanted to share my findings so far in case someone wants to pick it up.

Nitinsiwach commented 1 year ago

Why the cvbooster continues in that case? What does it even mean for the cvbooster to continue in that case? I knew about what you have linked but I thought that since there are no more splits and the individual boosters aren't training then that's where early stopping should hit. If the individual boosters aren't training then how did the cvbooster reach 2009 iterations?

And then there's also the second anomaly (I think) I have highlighted. Training the cv booster with early stopping that takes it to 2009th iteration vs training with 2009 estimators explicitly specified. The individual booster rounds are different as I've shown

microsoft / LightGBM