microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

[python-package] How to pass additional attributes to custom eval or loss functions in regression task? #6516

Open ComeTr-2097 opened 3 months ago

ComeTr-2097 commented 3 months ago

Hello Team,

Great work on the package. I know there is awesome support for custom eval and loss function. However, is there a way to pass additional attributes (not part of the feature set)? For example, there are three datasets, X, y, and extra (containing additional attributes). Their samples are one-to-one correspondence. I would like to create custom eval or loss functions using y_true, y_pred, and additional attributes. Here are the codes:

import lightgbm as lgb

# Create custom loss function
def custom_loss(y_true, y_pred):
    LE_true = y_true[:, 1]
    numerator = y_true[:, 2]
    DELTA = y_true[:, 3]
    gamma = y_true[:, 4]
    ra = y_true[:, 5]
    rs = y_pred
    # cp=1013
    LE_pred = numerator / (DELTA + gamma * (1 + rs / ra))

    # loss = np.mean(np.square(LE_true - LE_pred))
    grad = -2*(LE_true - LE_pred)
    hess = np.full_like(grad, 2)
    return grad, hess

# Create custom eval function
def custom_eval(y_true, y_pred):
    LE_true = y_true[:, 1]
    numerator = y_true[:, 2]
    DELTA = y_true[:, 3]
    gamma = y_true[:, 4]
    ra = y_true[:, 5]
    rs = y_pred
    # cp=1013
    LE_pred = numerator / (DELTA + gamma * (1 + rs / ra))

    loss = np.mean(np.square(LE_true - LE_pred))
    return 'custom_mse', loss, False

# Concat y_train and extra_train
ye_train = pd.concat([y_train, extra_train],axis=1)
# Concat y_val and extra_val
ye_val = pd.concat([y_val, extra_val],axis=1)

train_data = lgb.Dataset(X_train, label=ye_train)
val_data = lgb.Dataset(X_val, label=ye_val, reference=train_data)

I am trying to refer to Custom loss function in Keras based on the input data and combining y with extra as final y (ye in the above codes). However, the parameter label in lightgbm.Dataset is 1-D. Therefore the above codes are questionable. Meanwhile, I have found two similar questions (https://github.com/microsoft/LightGBM/issues/4009; https://github.com/microsoft/LightGBM/issues/1292). So, the question is, How to pass additional attributes to custom eval or loss functions in the LightGBM regression task? Thanks a lot.

Maybe:

# Create custom loss function
def custom_loss(y_true, y_pred, extra):
......
# Create custom eval function
def custom_eval(y_true, y_pred, extra):
......

Best, Chen

jmoralez commented 3 months ago

Hey. I believe there's not a way to do that currently, but maybe the suggestion in https://github.com/microsoft/LightGBM/issues/4995#issuecomment-1033009439 could help in the meantime.

jameslamb commented 3 months ago

+1 to @jmoralez 's suggestion, adding an extra attribute to the Dataset is one way to do that.

Other options I can think of:

ComeTr-2097 commented 3 months ago

@jameslamb @jmoralez Thank you for your timely suggestions. I will try the above methods and update my codes later.

ComeTr-2097 commented 3 months ago

@jameslamb @jmoralez I believe that adding an extra attribute to the Dataset is one way to do that. However, I wonder whether 'id' is one of the features (like 'X') that construct the LightGBM model in https://github.com/microsoft/LightGBM/issues/4995 (comment). Because my additional attributes or features do not have any relationship with the target variable ‘y’ in my regression task. They help custom eval or loss functions rather than act as features to predict 'y'. Looking forward to your reply.

ds.id = id # this is the extra attribute

jameslamb commented 3 months ago

You can follow that example to put any attribute you want on Dataset. It can be any Python object of any type or shape.

ComeTr-2097 commented 3 months ago

@jameslamb @jmoralez Thanks for your help! I have revised my codes by following the example in https://github.com/microsoft/LightGBM/issues/4995 (comment). Although it works, the loss does not decrease. Here are my codes:

import numpy as np
import pandas as pd
import random
import Hybrid_f2
import lightgbm as lgb

# Set seed
np.random.seed(42)
random.seed(42)

# Custom loss function
def custom_loss(y_pred, train_data):
    e_train_data = train_data.e_train
    LE_true = e_train_data[:, 0]
    numerator = e_train_data[:, 1]
    DELTA = e_train_data[:, 2]
    gamma = e_train_data[:, 3]
    ra = e_train_data[:, 4]
    rs = y_pred
    # cp=1013
    LE_pred = numerator / (DELTA + gamma * (1 + rs / ra))

    # # Calculate loss
    # loss = np.mean(np.square(LE_true - LE_pred))
    grad = -2*(LE_true - LE_pred)
    hess = np.full_like(grad, 2)
    return grad, hess

# Custom eval function
def custom_eval(y_pred, train_data):
    e_train_data = train_data.e_train
    LE_true = e_train_data[:, 0]
    numerator = e_train_data[:, 1]
    DELTA = e_train_data[:, 2]
    gamma = e_train_data[:, 3]
    ra = e_train_data[:, 4]
    rs = y_pred
    # cp=1013
    LE_pred = numerator / (DELTA + gamma * (1 + rs / ra))

    # Calculate loss
    loss = np.mean(np.square(LE_true - LE_pred))
    return 'custom_mse', loss, False

# Prepare dataset
external = pd.DataFrame()
external['es'] = Hybrid_f2.es_calc(EC_total['TA_Avg'].values)/100
external['ea'] = Hybrid_f2.ea_calc(EC_total['TA_Avg'].values,
                                   EC_total['RH_Avg'].values)/100
external['DELTA'] = Hybrid_f2.Delta_calc(EC_total['TA_Avg'].values)/100
external['gamma'] = Hybrid_f2.gamma_calc(EC_total['P_Avg'].values*100)/100
external['rho'] = Hybrid_f2.rho_calc(EC_total['TA_Avg'].values,
                                     EC_total['P_Avg'].values*100)
external['Rn'] = EC_total['Rn_Avg']
external['G'] = EC_total['G_Avg']
external['ra'] = EC_total['ra']

external['DELTA*(Rn-G)+rho*1013*(es-ea)/ra'] = external['DELTA']*(external['Rn']-external['G'])+external['rho']*1013*(external['es']-external['ea'])/external['ra']

external = external[['DELTA*(Rn-G)+rho*1013*(es-ea)/ra','DELTA','gamma','ra']]
external['LE'] = EC_total['LE']
external = external[['LE','DELTA*(Rn-G)+rho*1013*(es-ea)/ra','DELTA','gamma','ra']]

# Split dataset to train, test, val
X_train, X_vt, y_train, y_vt, e_train, e_vt = train_test_split(X2_scaled, y2, external, test_size=0.3, random_state=42, shuffle=True)
X_val, X_test, y_val, y_test, e_val, e_test = train_test_split(X_vt, y_vt, e_vt, test_size=1/3, random_state=42, shuffle=True)

# Prepare train_data
X_train = X_train.values
y_train = y_train.values
e_train = e_train.values

# Prepare val_data
X_val = X_val.values
y_val = y_val.values
e_val = e_val.values

# Prepare test_data
X_test = X_test.values
y_test = y_test.values
e_test = e_test.values

# Convert to lgb.Dataset
train_data = lgb.Dataset(X_train, label=y_train)
train_data.e_train = e_train

val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
val_data.e_val = e_val

# Define params
params = {
    'objective': custom_loss,
    'boosting': 'gbdt',
    'learning_rate': 0.1,
    'num_leaves': 31,
    'metric': 'None'
}
# Set callbacks functions
log_evaluation = lgb.log_evaluation(1)

# Fit model
gbm = lgb.train(params = params,
                train_set = train_data,
                num_boost_round = 200,
                valid_sets = [train_data], # val_data
                feval = custom_eval,
                callbacks = [log_evaluation])

# Save model
gbm.save_model('custom_model.txt')

My output in _traindata:

[1] training's custom_mse: 6.23892e+06
[2] training's custom_mse: 2.15529e+07
[3] training's custom_mse: 2.16454e+09
[4] training's custom_mse: 1.18045e+07
[5] training's custom_mse: 6.7505e+06
[6] training's custom_mse: 4.65361e+07
[7] training's custom_mse: 1.65079e+07
[8] training's custom_mse: 6.09358e+06
[9] training's custom_mse: 3.71192e+06
[10]    training's custom_mse: 3.14332e+06
......
......
......
[190]   training's custom_mse: 1.02903e+06
[191]   training's custom_mse: 7.857e+09
[192]   training's custom_mse: 2.65892e+06
[193]   training's custom_mse: 2.38379e+09
[194]   training's custom_mse: 3.8954e+06
[195]   training's custom_mse: 498642
[196]   training's custom_mse: 498837
[197]   training's custom_mse: 870972
[198]   training's custom_mse: 219080
[199]   training's custom_mse: 1.1609e+06
[200]   training's custom_mse: 141097

My output in _valdata:

[1] valid_0's custom_mse: 4.90315e+06
[2] valid_0's custom_mse: 1.90869e+07
[3] valid_0's custom_mse: 7.65195e+07
[4] valid_0's custom_mse: 1.29522e+07
[5] valid_0's custom_mse: 1.11674e+06
[6] valid_0's custom_mse: 3.79441e+06
[7] valid_0's custom_mse: 7.66696e+07
[8] valid_0's custom_mse: 2.28049e+06
[9] valid_0's custom_mse: 832201
[10]    valid_0's custom_mse: 1.43712e+06
......
......
......
[190]   valid_0's custom_mse: 75458.3
[191]   valid_0's custom_mse: 57434.1
[192]   valid_0's custom_mse: 151523
[193]   valid_0's custom_mse: 597265
[194]   valid_0's custom_mse: 449191
[195]   valid_0's custom_mse: 146445
[196]   valid_0's custom_mse: 1.75591e+06
[197]   valid_0's custom_mse: 288885
[198]   valid_0's custom_mse: 751486
[199]   valid_0's custom_mse: 28503.6
[200]   valid_0's custom_mse: 660888
jameslamb commented 3 months ago

Thanks for that, could you please reduce and simplify this code, and try to make it reproducible? It's quite a lot of code to look through, and objects like EC_total, X2_scaled, and others are not defined. As another example... your code is generating validation and test sets and then not using them... all of that could probably be removed.

If you can narrow this down to an example that:

We'd be happy to investigate further.

ComeTr-2097 commented 3 months ago

Thanks a lot. I am trying to create an example as minimal as possible and give more details. I will update my codes later.