microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.53k stars 3.82k forks source link

Allow early stopping in Sklearn Pipeline that has a custom transformer #5090

Open rosscleung opened 2 years ago

rosscleung commented 2 years ago

Summary

LightGBM's sklearn api classifier, LGBMClassifier, allows you to designate early_stopping_rounds, eval_metric, and eval_set parameters in its LGBMClassifier.fit() method. While it's convenient, it doesn't play well with a custom data processor and sklearn's Gridseach. Example:

ml_pipeline = Pipeline(steps=[
    ('cdf',custom_data_transformer()),
    ('lgb',LGBMClassifier())])

# You can't throw in lgb__early_stopping_rounds here because that parameter
# is used during the .fit() method, not the instantiation of the LGBMClassifier()
params = {'lgb__max_depth':np.arange(3,10),
          'lgb__reg_alpha':np.linspace(0,1,num=11),
         }

rgs = RandomizedSearchCV(estimator=ml_pipeline,
                         param_distributions=params,
                         n_iter=10,
                         cv=5)

# So we designate lgb__early_stopping_rounds in the RandomizedGridSearchCV
# .fit() method. but oour eval_set() will not have gone through 
# custom_data_transformer(), so the x_train and x_test will be very different.
rgs.fit(x_train,y_train,
        lgb__early_stopping_rounds=10,
        lgb__eval_set=[(x_test,y_test)],
        lgb__eval_metric='auc')

Motivation

LightGBM works very well on its own but since early stopping and eval_set are parameters set at fit() time, it isn't compatible with scikit-learn's Pipeline.

Description

If LightGBM's sklearn API plays well with sklearn's Pipeline API, it will encourage more adoption!

References

jmoralez commented 2 years ago

Linking #3313.