microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
MIT License
16.69k stars 3.83k forks source link

[Python] Using early_stopping_rounds with GridSearchCV / GroupKFold #1044

Closed mandeldm closed 6 years ago

mandeldm commented 7 years ago

I'm using LGBMRegressor with sklearn.model_selection.GridSearchCV, with cross-validation split based on sklearn.model_selection.GroupKFold. When I include early_stopping_rounds=5 in the estimator, I get the following error:

ValueError: For early stopping, at least one dataset and eval metric is required for evaluation

Without the early_stopping_rounds argument the code runs fine.

I could be wrong, but it seems that LGBMRegressor does not view the cv argument in GridSearchCV and groups argument in as a legitimate eval dataset.

Is there a proper way to use early_stopping_rounds with GridSearchCV / GroupKFold?

Thank you.

wxchan commented 7 years ago

maybe you can use GroupKFold + ParameterGrid to simulate GridSearchCV.

btw, could you give a simple runnable case we can debug on? we might try make GridSearchCV work. thanks

mandeldm commented 7 years ago

Here is a simple version of the code that runs fine. Uncommenting early_stopping_rounds=5 breaks it for me.

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV, GroupKFold


# Build 2 categorical features, 3 float ones & an error term:
categorical_df = pd.DataFrame(np.random.randint(2, size=(1000, 2)), columns=['cat01', 'cat02']).astype(int)
float_df = pd.DataFrame(np.random.rand(1000, 4), columns=['aa', 'bb', 'cc', 'error'])

# Build a group feature for GroupKFold (k=4):
kfold_df = pd.DataFrame(np.random.randint(4, size=(1000, )), columns=['grp'])

# Assemble features into a df.
df = pd.concat([categorical_df, float_df, kfold_df], axis=1)

# Create a dependent variable.
df['yy'] = \
df['aa'] + 2 * df['bb'] + \
df['cc'] * df['cat01'] + \
df['error'] * (1 + df['cat02'])

# Identify our X, y & categorical columns.
Xcols = ['cat01', 'cat02', 'aa', 'bb', 'cc']
y_col = 'yy'
categoricals = ['cat01', 'cat02']

# Set up and run LGBM with GridSearchCV based on GroupKFold.
gkf = GroupKFold(n_splits=4).split(X=df[Xcols],

param_grid = {
    'num_leaves': [31, 127],
    'feature_fraction': [0.5, 1.0],
    'bagging_fraction': [0.75, 0.95], 
    'reg_alpha': [0.1, 0.5]}

lgb_estimator = lgb.LGBMRegressor(boosting_type='gbdt',
                                  categorical_feature=[Xcols.index(col) for col in categoricals])#,
#                                   early_stopping_rounds=5) # REMOVING THIS ARGUMENT MAKES THE CODE RUN OKAY

gsearch = GridSearchCV(estimator=lgb_estimator, 

lgb_model =[Xcols], 

print(lgb_model.best_params_, lgb_model.best_score_)
wxchan commented 7 years ago

I check GridSearchCV codes, the logic is train and test; we need a valid set during training for early stopping, it should not be test set.

Except this, early_stopping_rounds should pass to fit function like lgb_model =[Xcols], y=df[y_col], eval_set=(df[Xcols], df[y_col]), early_stopping_rounds=5), though it may not be what you want.

mandeldm commented 7 years ago

When I add eval_set and early_stopping_rounds to fit function as you did, I get: TypeError: fit() got an unexpected keyword argument 'eval_set'

I'm using the following versions: Python 3.5.3 (WinPython) on Windows 10 Jupyter 4.3.0 np 1.13.3 pd 0.20.3 lgb 2.0.10 sklearn 0.18.1

UPDATE: I did not realize that began supporting kwargs in 0.19. Thank you, I will update my package.

But that does not really solve the original problem, since evaluating on the training set is not a great idea. Is my best option to write nested loops over both the validation folds and the parameter grid?


wxchan commented 7 years ago

you can try our native api

julioasotodv commented 7 years ago

I think that it is simpler that your last comment @mandeldm.

As @wxchan said, perform a K-Fold cross validation for a lgbm model, and allows early stopping.

At the end of the day, sklearn's GridSearchCV just does that (performing K-Fold) + turning your hyperparameter grid to a iterable with all possible hyperparameter combinations. This means that you could just use for hyperparameter optimization, with early stopping embedded in each experiment (each hyperparameter combination). A very naïve (but correct way) to do so would be something like:

import lightgbm

# Whatever dataset you have

from sklearn.model_selection import train_test_split

train, test = train_test_split( # with your dataset

# Imagine now that you want to optimize num_leaves and
# learning_rate, and also use early stopping:
num_leaves_choices = [56, 128, 256]
learning_rate_choices = [0.05, 0.1, 0.2]

# We will store the cross validation results in a simple list,
# with tuples in the form of (hyperparam dict, cv score):
cv_results = []

for num_lv in num_leaves_choices:
    for lr in learning_rate_choices:
        hyperparams = {"objective": # whatever,
                                   "num_leaves": num_lv,
                                   "learning_rate": lr,
                                    # Other constant hyperparameters
        validation_summary =,
                                                                 num_boost_round=4096, # any high number will do
                                                                 early_stopping_rounds=50, # Here it is
        optimal_num_trees = len(validation_summary["auc-mean"])
        # Let's just add the optimal number of trees (chosen by early stopping)
        # to the hyperparameter dictionary:
        hyperparams["optimal_number_of_trees"] = optimal_num_trees

       # And we append results to cv_results:
       cv_results.append((hyperparams, validation_summary["auc-mean"][-1]))

Obviously, the nested for loops is a very basic approach. Fortunately, scikit-learn's ParameterGrid can build an iterator for you :)

guolinke commented 6 years ago

@wxchan Is this solved ?

Interstella12 commented 6 years ago

I received the same error “ValueError: For early stopping, at least one dataset and eval metric is required for evaluation”, my code are like this,obviously, I've set the 'eval_set' and 'eval_metic' `lgbc_params = { 'boosting_type': "gbdt", # string, optional (default="gbdt") 'num_leaves': 31, # int, optional (default=31) 'max_depth': -1, # int, optional (default=-1) 'objective': 'binary', # string, callable or None, optional (default=None) 'random_state' : 100, # int or None, optional (default=None) 'n_jobs' : -1, # optional (default=-1) 'silent' : 0, # bool, optional (default=True) } lgbc_fit_params = { 'eval_set' : [(x_eval,y_eval)], # list or None, optional (default=None)

A list of (X, y) tuple pairs to use as a validation sets for early-stopping.

    'eval_names' :['evalset'], # list of strings or None, optional (default=None)
    'eval_metric' : ['f1'], # string, list of strings, callable or None, optional (default=None)
    'early_stopping_rounds' : 10   # int or None, optional (default=None)

} n_lr = np.linspace(0.01,0.2,19) param_grid = { 'learning_rate':n_lr, }

lgbc = LGBMClassifier(lgbc_params) gs_lgbc = model_selection.GridSearchCV(lgbc,return_train_score=True, param_grid=param_grid, cv=5, refit=True, n_jobs=1,verbose=0,scoring='f1'), y=y_train, lgbc_fit_params) `

guolinke commented 6 years ago

@Interstella12 I think f1 is not the eval_metric in LightGBM. you can set it to a right one.

bhaskar-c commented 6 years ago


Given your code example above, where you loop through each parameter, I see that you have divided the data into train and test set. But then you don't seem to have used the test data from the split anywhere for validation.

I think there is something missing in the code or I am missing something very obvious. Can you explain, what purpose does the train-test split serve in that case ?

julioasotodv commented 6 years ago

@quakig Indeed, I am not using the test set in the example.

In supervised ML, it is a good practise to have a separate, hold-out test dataset just for metrics purposes, once you have selected you best model and hyperparameters. This means that I am first separating a test set, then using for performing K-fold cross validation to try different hyperparameter combinations on the train (+ validation) set, and once those are chosen, see how well the model performs in the test set.*4G__SV580CxFj78o9yUXuQ.png shows an example (the first diagram).

louisabraham commented 6 years ago

I have a similar problem that is not solved. I want to do a grid search on a whole pipeline, so I cannot use cv.

IMO, it would be much better to have a parameter validation_fraction like GradientBoostingClassifier.

How is it done in cv? I suppose that it trains on k-2 folds, and uses one fold for validation and one for test. I don't think it would be particularly horrible to use a random fold for the validation.

seahrh commented 5 years ago

I was able to use GridSearchCV with early stopping rounds. Working example

model = LGBMClassifier(...)
cv = GridSearchCV(pipe, cv=ps, param_grid=param_grid, scoring='roc_auc'), y_train,
  model__eval_set=[(x_val, y_val)], model__eval_metric='auc', model__early_stopping_rounds=200, 
lbo34 commented 5 years ago

@seahrh : Really? it worked this way? I am getting "fit() got an unexpected keyword argument 'model__eval_set'"

suissemaxx commented 5 years ago

This example works for me:

gbm = lgb.LGBMRegressor(n_jobs=-1)

param_grid = {
    "num_leaves" : np.linspace(10, 200, 4, dtype=np.int32),
    'learning_rate': np.linspace(0.1, 1, 5),
    'n_estimators': np.linspace(10, 1000, 5, dtype=np.int32),
    'early_stopping_rounds' : [20],

gbm = GridSearchCV(gbm, param_grid, cv=3, scoring="neg_mean_squared_error", verbose=100, n_jobs=-1), y_train, eval_set=[(X_test, y_test)], eval_metric="rmse")
print('Best parameters:', gbm.best_params_)