microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.7k stars 3.83k forks source link

Different number of trees in CVBooster object between versions 3.3.5 and 4.1.0 #6211

Closed dtararuj closed 11 months ago

dtararuj commented 11 months ago

Description

Hi, I faced a strange issue. I've tried to create model to predict possitive and negative values as an output. I am using default objectice, and CV with 3 folds.

When I executed code using lgb==3.3.5 I have many more trees in my model file than when I used newer version.

Using newest version i i have one or two tree in the model .txt file, using oldest version many more.

Is it expected behaviour ? Related to this I have also worse accuracy.

Reproducible example

import numpy as np
import pandas as pd
import lightgbm as lgb
import random
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OrdinalEncoder
def lgb_mae(preds, train_data):
    y_true = train_data.get_label()
    error = mean_absolute_error(y_true, preds)
    return "mae", error, False

# !pip install lightgbm==3.3.5 --user

print(lgb.__version__)
model_name = f'test_{lgb.__version__}'

n_records = 120
account_id = ['xxx'] * n_records
volume= random.sample(range(-1000, 1000), n_records)
date = pd.date_range(pd.to_datetime('2021-01-01'), periods =n_records, freq = '1W')
var1 = (np.random.uniform(low=0, high=100, size=(n_records,))).round(2)
var2 = (np.random.uniform(low=0.5, high=20, size=(n_records,))).round(2)

df = pd.DataFrame(list(zip(account_id, volume, date, var1, var2)),
               columns =['acc_id', 'volume','date', 'var1','var2'])

df['quarter'] = df.date.dt.quarter
df['week'] = df.date.dt.week
df['month'] = df.date.dt.month

train = df.loc[df['date']<'2023-01-01']
test = df.loc[df['date']>='2023-01-01']

features = df.columns[~df.columns.isin(['volume','acc_id','date'])]

lgb_train = lgb.Dataset(
        train[features], train.loc[:, "volume"])
lgb_test = lgb.Dataset(test[features], test.loc[:, "volume"])

num_iterations = 100 
lgb_fun = lgb_mae
constraints={"var2": -1}
stop_rounds = 50
nfolds = 3

lgb_params = {
        'monotone_constraints': [0,-1,0,0,0],
        'monotone_constraints_method': 'advanced',
        "verbose": -1,
        "metrics": "None",
        "feature_pre_filter": False,
        'learning_rate': 0.170714,
        'num_leaves': 160,
        'max_depth': 18,
        'min_data_in_leaf' : 16,
        'bagging_fraction': 0.979877,
        'feature_fraction': 0.452952,
        'lambda_l1': 0.0105841,
        'lambda_l2': 9.63261e-08,
    }

model_cv = lgb.cv(
        params=lgb_params,
        train_set=lgb_train,
        num_boost_round=num_iterations,
        nfold=nfolds,
        return_cvbooster=True,
        stratified=False,
        feval=lgb_fun,
        callbacks=[
            lgb.early_stopping(stopping_rounds=stop_rounds, verbose=False),
            lgb.log_evaluation(0),
        ],
)
model_1 = model_cv.get("cvbooster")
for i in range(len(model_1.boosters)):
    model_1.boosters[i].save_model(
        f"model_{model_name}_cv{i}.txt"
    )
models = list(model_1.boosters)
jmoralez commented 11 months ago

Hey @dtararuj, thanks for using LightGBM. This is due to #5066. Previously the individual boosters in the CVBooster object would keep all training iterations, regardless of what the best iteration was. So for example if early stopping was performed and the best iteration was 5, previously the boosters had 55 rounds (since you set stopping_rounds=50), which were all saved. Now it's only saving up until the best iteration (5 in this example). If you want to save all of them you can do something like:

for bst in model_1.boosters:
    bst.save_model(
        f'model_{model_name}_cv{i}.txt',
        num_iteration=bst.current_iteration(),
    )
dtararuj commented 11 months ago

ok it makes sense, thank you