microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.65k stars 3.83k forks source link

Custom median objective function in lightgbm.cv() #6620

Closed arumds closed 1 month ago

arumds commented 2 months ago

LightGBM version 4.0.0

The objective='regression' trains to predict the mean representation of the data. And i am interested to train to predict median representation of the actual values. Infact, a quanitle model with alpha=0.5 will solve the problem. However, the quantile model does not work with monotone_constraints parameter which is essential in our case. Therefore, a custom median_loss is used as objective passed to the params.

def median_loss(preds, train_data: lgb.Dataset):
        y_true = train_data.get_label()
        residual = preds - y_true
        grad = np.where(residual > 0, 0.5, -0.5)
        hess = np.ones_like(grad)  # Hessian is constant for median pinball loss
        return grad, hess
params={
        "objective": median_loss,
    },

cv_result = lgb.cv(params, dtrain, nfold=n_folds,  stratified=False, return_cvbooster=True)
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

Debugging shows that all predictions during the lgb.cv step are 0's and therefore the gradients are uniform. It might not be providing LightGBM with sufficient gradient information to make meaningful splits.

Does anyone have a suggestion on how to train the model effectively with medain_loss custom objective or with a quantile objective preserving the monotonic constraint. @jameslamb @vladv14

jmoralez commented 2 months ago

Hey. Thanks for using LightGBM. Can you try setting the condition to greater equal? i.e.

grad = np.where(residual >= 0, 0.5, -0.5)
arumds commented 2 months ago

@jmoralez tried setting to grad = np.where(residual >= 0, 0.5, -0.5)

params={
        "objective": median_loss,
    },

cv_result = lgb.cv(params, dtrain, nfold=n_folds,  metrics='rmse', stratified=False, return_cvbooster=True)

Log:

[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[1] cv_agg's train rmse: 4.66734 + 0.00107263   cv_agg's valid rmse: 4.66734 + 0.00428721
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

When tried to debug the median_loss objective at the execution of lgb.cv(), the pred are all zero as seen in the screenshot:

Screenshot 2024-08-22 at 23 09 04

With the obective='regression' the model gets trained normally. Logs are below:

1]  cv_agg's train rmse: 0.730986 + 0.000761274 cv_agg's valid rmse: 0.730999 + 0.00305853
[2] cv_agg's train rmse: 0.724106 + 0.000747364 cv_agg's valid rmse: 0.724126 + 0.00305247
[3] cv_agg's train rmse: 0.717755 + 0.000743182 cv_agg's valid rmse: 0.717786 + 0.00304095
[4] cv_agg's train rmse: 0.711056 + 0.000728518 cv_agg's valid rmse: 0.711092 + 0.00303802
[5] cv_agg's train rmse: 0.704382 + 0.000716823 cv_agg's valid rmse: 0.704426 + 0.00302899
[6] cv_agg's train rmse: 0.69778 + 0.00070809   cv_agg's valid rmse: 0.697832 + 0.00301913
[7] cv_agg's train rmse: 0.691297 + 0.000700247 cv_agg's valid rmse: 0.691353 + 0.00301123
[8] cv_agg's train rmse: 0.685269 + 0.000683244 cv_agg's valid rmse: 0.685337 + 0.00301251
[9] cv_agg's train rmse: 0.678915 + 0.000665435 cv_agg's valid rmse: 0.678987 + 0.00301451
[10]    cv_agg's train rmse: 0.672621 + 0.000661577 cv_agg's valid rmse: 0.672699 + 0.00300223
[11]    cv_agg's train rmse: 0.666394 + 0.000655792 cv_agg's valid rmse: 0.666477 + 0.00299132
jmoralez commented 2 months ago

When using a custom objective LightGBM sets the init score as 0 and if it doesn't find a gain with any split you may be left with a single tree with only the root, you can verify this if you use the trees_to_dataframe method.

If you're able to provide a reproducible example we can assist further. The following seems to train normally:

import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_regression

def median_loss(preds, train_data: lgb.Dataset):
    y_true = train_data.get_label()
    residual = preds - y_true
    grad = np.where(residual >= 0, 0.5, -0.5)
    hess = np.ones_like(grad)  # Hessian is constant for median pinball loss
    return grad, hess

X, y = make_regression(n_samples=1000, n_features=2)
dtrain = lgb.Dataset(X, y)
params={"objective": median_loss, 'num_leaves': 32, 'verbosity': -1, 'metrics': 'l2'}
cv_hist = lgb.cv(
    params,
    dtrain,
    num_boost_round=10,
    nfold=2,
    stratified=False,
    callbacks=[lgb.log_evaluation(1)],
)
# [1]   cv_agg's valid l2: 15698.8 + 269.489
# [2]   cv_agg's valid l2: 15689.7 + 269.239
# [3]   cv_agg's valid l2: 15680.5 + 268.99
# [4]   cv_agg's valid l2: 15671.4 + 268.741
# [5]   cv_agg's valid l2: 15662.2 + 268.491
# [6]   cv_agg's valid l2: 15653.1 + 268.242
# [7]   cv_agg's valid l2: 15644 + 267.993
# [8]   cv_agg's valid l2: 15634.8 + 267.744
# [9]   cv_agg's valid l2: 15625.7 + 267.495
# [10]  cv_agg's valid l2: 15616.6 + 267.246
arumds commented 2 months ago

@jmoralez Attached is a test dtrain binary file which can be used to reproduce as below:

dataset_from_file = lgb.Dataset(data="test.bin")

params={"objective": median_loss, 'num_leaves': 32, 'verbosity': -1, 'metrics': 'l2'}
cv_hist = lgb.cv(
    params,
    dataset_from_file,
    num_boost_round=10,
    nfold=2,
    stratified=False,
    callbacks=[lgb.log_evaluation(1)],
    seed=0,
    metrics='rmse',
    eval_train_metric=True,
    return_cvbooster=True)

test.bin.zip

Unzip the file to test.bin

jmoralez commented 2 months ago

Did you inspect the produced trees?

arumds commented 2 months ago

You mean to get the model from lgb.train after lgb.cv and inspect the trees? If so, yes there seem to be only root.

The hyper_params from the lgb.cv() and BayesianOptimization returns

`{'num_iterations': 500, 'early_stopping_rounds': 50, 'bagging_freq': 1, 'learning_rate': 0.01, 'verbosity': -1, 'monotone_constraints': [0, 0, 0, -1, 0, 1], 'objective': <function median_loss at 0x3126261f0>, 'bagging_fraction': 0.8646440511781974, 'feature_fraction': 0.9145568099117258, 'lambda_l1': 0.006027633760716439, 'lambda_l2': 0.005448831829968969, 'max_depth': 14, 'min_child_weight': 0.6394705825246829, 'min_data_in_leaf': 16, 'min_gain_to_split': 0.045670920031283195, 'num_leaves': 292}`

The model is trained with these hyper params and yields:

lgb.Booster.trees_to_dataframe(model)
Out[5]: 
   tree_index  node_depth node_index left_child right_child parent_index  \
0           0           1       0-L0       None        None         None   
  split_feature split_gain threshold decision_type missing_direction  \
0          None       None      None          None              None   
  missing_type  value weight count  
0         None      0   None  None  

Does this indicate that the median_loss objective is not good for the dataset?

jmoralez commented 2 months ago

That means LightGBM isn't able to find a split that satisfies the constraints you've set (min_gain_to_split, min_data_in_leaf, min_child_weight, etc).

This doesn't seem to be an issue within LightGBM or your custom loss, I'm pretty sure you'd get the same result if you used the built-in loss (single tree with only the root which predicts the init score).

If you have very few samples you could try getting more data or reducing the constraints (in case 16 is your minimum min_data_in_leaf for example)

arumds commented 2 months ago

@jmoralez The hyper parameter boundaries for tuning are shown below:

hyperparam_boundaries = {'num_leaves': (100, 300),
                             'max_depth': (10, 20),
                             'feature_fraction': (0.7, 1),
                             'bagging_fraction': (0.7, 1),
                             'min_data_in_leaf': (10, 25),
                             'min_gain_to_split': (0.01,0.05),
                             'lambda_l1': (0, 0.01),
                             'lambda_l2': (0, 0.01)
                             }

And the built-in regression objective gives the best hyper parameters by bayesian hyper param tuning with lgb.cv() cross validation:

{'num_iterations': 500, 'early_stopping_rounds': 50, 'bagging_freq': 1, 'learning_rate': 0.01, 'verbosity': -1, 'monotone_constraints': [0, 0, 0, -1, 0, 1], 'objective': 'regression', 'bagging_fraction': 0.8150324556477333, 'feature_fraction': 0.9375175114247993, 'lambda_l1': 0.005288949197529045, 'lambda_l2': 0.0056804456109393235, 'max_depth': 19, 'min_child_weight': 0.07041859401008829, 'min_data_in_leaf': 11, 'min_gain_to_split': 0.010808735897613029, 'num_leaves': 266}

And there are >1 trees

lgb.Booster.trees_to_dataframe(model)
Out[2]: 
        tree_index  node_depth node_index  ...     value   weight  count
0                0           1       0-S0  ...  4.607710      0.0  66367
1                0           2       0-S2  ...  4.615160  29156.0  29156
2                0           3       0-S7  ...  4.616940  17398.0  17398
3                0           4      0-S18  ...  4.618880   2726.0   2726
4                0           5      0-S53  ...  4.621150    455.0    455
...            ...         ...        ...  ...       ...      ...    ...
265495         499          10   499-L241  ... -0.000076     20.0     20
265496         499          10   499-L256  ...  0.000423     11.0     11
265497         499           7   499-S254  ... -0.000418     25.0     25
265498         499           8    499-L38  ... -0.000174     12.0     12
265499         499           8   499-L255  ... -0.000677     13.0     13

The issue occurs only while using custom loss function where it cannot find a split and only predicts the init score 0.

arumds commented 2 months ago

@jmoralez is there anything i am missing out here?

jmoralez commented 2 months ago

What are you returning as the trial's score? As I said, when using a custom objective, LightGBM starts boosting from zero, which might hurt the convergence.

Can you try the approach in https://github.com/microsoft/LightGBM/issues/5114#issuecomment-1084994020 by setting the init score in your dataset (to the target's median in this case), adding it back to your predictions and then computing your metric on that? If you're using a built-in metric it won't work because it won't take into account the init scores.

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!