microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

[Python] Using early_stopping_rounds with GridSearchCV / GroupKFold #1044

Closed mandeldm closed 6 years ago

mandeldm commented 7 years ago

I'm using LGBMRegressor with sklearn.model_selection.GridSearchCV, with cross-validation split based on sklearn.model_selection.GroupKFold. When I include early_stopping_rounds=5 in the estimator, I get the following error:

ValueError: For early stopping, at least one dataset and eval metric is required for evaluation

Without the early_stopping_rounds argument the code runs fine.

I could be wrong, but it seems that LGBMRegressor does not view the cv argument in GridSearchCV and groups argument in GridSearchCV.fit as a legitimate eval dataset.

Is there a proper way to use early_stopping_rounds with GridSearchCV / GroupKFold?

Thank you.

wxchan commented 7 years ago

maybe you can use GroupKFold + ParameterGrid to simulate GridSearchCV. http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html

btw, could you give a simple runnable case we can debug on? we might try make GridSearchCV work. thanks

mandeldm commented 7 years ago

Here is a simple version of the code that runs fine. Uncommenting early_stopping_rounds=5 breaks it for me.

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV, GroupKFold

np.random.seed(1)

# Build 2 categorical features, 3 float ones & an error term:
categorical_df = pd.DataFrame(np.random.randint(2, size=(1000, 2)), columns=['cat01', 'cat02']).astype(int)
float_df = pd.DataFrame(np.random.rand(1000, 4), columns=['aa', 'bb', 'cc', 'error'])

# Build a group feature for GroupKFold (k=4):
kfold_df = pd.DataFrame(np.random.randint(4, size=(1000, )), columns=['grp'])

# Assemble features into a df.
df = pd.concat([categorical_df, float_df, kfold_df], axis=1)

# Create a dependent variable.
df['yy'] = \
df['aa'] + 2 * df['bb'] + \
df['cc'] * df['cat01'] + \
df['error'] * (1 + df['cat02'])

# Identify our X, y & categorical columns.
Xcols = ['cat01', 'cat02', 'aa', 'bb', 'cc']
y_col = 'yy'
categoricals = ['cat01', 'cat02']

# Set up and run LGBM with GridSearchCV based on GroupKFold.
gkf = GroupKFold(n_splits=4).split(X=df[Xcols],
                                   y=df[y_col], 
                                   groups=df['grp'])

param_grid = {
    'num_leaves': [31, 127],
    'feature_fraction': [0.5, 1.0],
    'bagging_fraction': [0.75, 0.95], 
    'reg_alpha': [0.1, 0.5]}

lgb_estimator = lgb.LGBMRegressor(boosting_type='gbdt',
                                  objective='regression',
                                  bagging_freq=5,
                                  num_boost_round=50,
                                  learning_rate=0.01,
                                  eval_metric='l1',
                                  categorical_feature=[Xcols.index(col) for col in categoricals])#,
#                                   early_stopping_rounds=5) # REMOVING THIS ARGUMENT MAKES THE CODE RUN OKAY

gsearch = GridSearchCV(estimator=lgb_estimator, 
                       param_grid=param_grid, 
                       cv=gkf) 

lgb_model = gsearch.fit(X=df[Xcols], 
                        y=df[y_col])

print(lgb_model.best_params_, lgb_model.best_score_)
wxchan commented 7 years ago

I check GridSearchCV codes, the logic is train and test; we need a valid set during training for early stopping, it should not be test set.

Except this, early_stopping_rounds should pass to fit function like lgb_model = gsearch.fit(X=df[Xcols], y=df[y_col], eval_set=(df[Xcols], df[y_col]), early_stopping_rounds=5), though it may not be what you want.

mandeldm commented 7 years ago

When I add eval_set and early_stopping_rounds to fit function as you did, I get: TypeError: fit() got an unexpected keyword argument 'eval_set'

I'm using the following versions: Python 3.5.3 (WinPython) on Windows 10 Jupyter 4.3.0 np 1.13.3 pd 0.20.3 lgb 2.0.10 sklearn 0.18.1

UPDATE: I did not realize that GridSearchCV.fit began supporting kwargs in 0.19. Thank you, I will update my package.

But that does not really solve the original problem, since evaluating on the training set is not a great idea. Is my best option to write nested loops over both the validation folds and the parameter grid?

Thanks.

wxchan commented 7 years ago

you can try our native api lightgbm.cv

julioasotodv commented 7 years ago

I think that it is simpler that your last comment @mandeldm.

As @wxchan said, lightgbm.cv perform a K-Fold cross validation for a lgbm model, and allows early stopping.

At the end of the day, sklearn's GridSearchCV just does that (performing K-Fold) + turning your hyperparameter grid to a iterable with all possible hyperparameter combinations. This means that you could just use lightgbm.cv for hyperparameter optimization, with early stopping embedded in each experiment (each hyperparameter combination). A very naïve (but correct way) to do so would be something like:

import lightgbm

# Whatever dataset you have

from sklearn.model_selection import train_test_split

train, test = train_test_split( # with your dataset

# Imagine now that you want to optimize num_leaves and
# learning_rate, and also use early stopping:
num_leaves_choices = [56, 128, 256]
learning_rate_choices = [0.05, 0.1, 0.2]

# We will store the cross validation results in a simple list,
# with tuples in the form of (hyperparam dict, cv score):
cv_results = []

for num_lv in num_leaves_choices:
    for lr in learning_rate_choices:
        hyperparams = {"objective": # whatever,
                                   "num_leaves": num_lv,
                                   "learning_rate": lr,
                                    # Other constant hyperparameters
                                 }
        validation_summary = lightgbm.cv(hyperparams,
                                                                 train,
                                                                 num_boost_round=4096, # any high number will do
                                                                 nfold=10,
                                                                 metrics=["auc"],
                                                                 early_stopping_rounds=50, # Here it is
                                                                 verbose_eval=10)
        optimal_num_trees = len(validation_summary["auc-mean"])
        # Let's just add the optimal number of trees (chosen by early stopping)
        # to the hyperparameter dictionary:
        hyperparams["optimal_number_of_trees"] = optimal_num_trees

       # And we append results to cv_results:
       cv_results.append((hyperparams, validation_summary["auc-mean"][-1]))

Obviously, the nested for loops is a very basic approach. Fortunately, scikit-learn's ParameterGrid can build an iterator for you :)

guolinke commented 6 years ago

@wxchan Is this solved ?

Interstella12 commented 6 years ago

I received the same error “ValueError: For early stopping, at least one dataset and eval metric is required for evaluation”, my code are like this,obviously, I've set the 'eval_set' and 'eval_metic' `lgbc_params = { 'boosting_type': "gbdt", # string, optional (default="gbdt") 'num_leaves': 31, # int, optional (default=31) 'max_depth': -1, # int, optional (default=-1) 'objective': 'binary', # string, callable or None, optional (default=None) 'random_state' : 100, # int or None, optional (default=None) 'n_jobs' : -1, # optional (default=-1) 'silent' : 0, # bool, optional (default=True) } lgbc_fit_params = { 'eval_set' : [(x_eval,y_eval)], # list or None, optional (default=None)

A list of (X, y) tuple pairs to use as a validation sets for early-stopping.

    'eval_names' :['evalset'], # list of strings or None, optional (default=None)
    'eval_metric' : ['f1'], # string, list of strings, callable or None, optional (default=None)
    'early_stopping_rounds' : 10   # int or None, optional (default=None)

} n_lr = np.linspace(0.01,0.2,19) param_grid = { 'learning_rate':n_lr, }

lgbc = LGBMClassifier(lgbc_params) gs_lgbc = model_selection.GridSearchCV(lgbc,return_train_score=True, param_grid=param_grid, cv=5, refit=True, n_jobs=1,verbose=0,scoring='f1') gs_lgbc.fit(X=x_train, y=y_train, lgbc_fit_params) `

guolinke commented 6 years ago

@Interstella12 I think f1 is not the eval_metric in LightGBM. you can set it to a right one.

bhaskar-c commented 6 years ago

@julioasotodv

Given your code example above, where you loop through each parameter, I see that you have divided the data into train and test set. But then you don't seem to have used the test data from the split anywhere for validation.

I think there is something missing in the code or I am missing something very obvious. Can you explain, what purpose does the train-test split serve in that case ?

julioasotodv commented 6 years ago

@quakig Indeed, I am not using the test set in the example.

In supervised ML, it is a good practise to have a separate, hold-out test dataset just for metrics purposes, once you have selected you best model and hyperparameters. This means that I am first separating a test set, then using lightgbm.cv for performing K-fold cross validation to try different hyperparameter combinations on the train (+ validation) set, and once those are chosen, see how well the model performs in the test set.

https://cdn-images-1.medium.com/max/1000/1*4G__SV580CxFj78o9yUXuQ.png shows an example (the first diagram).

louisabraham commented 6 years ago

I have a similar problem that is not solved. I want to do a grid search on a whole pipeline, so I cannot use cv.

IMO, it would be much better to have a parameter validation_fraction like GradientBoostingClassifier.

How is it done in cv? I suppose that it trains on k-2 folds, and uses one fold for validation and one for test. I don't think it would be particularly horrible to use a random fold for the validation.

seahrh commented 5 years ago

I was able to use GridSearchCV with early stopping rounds. Working example

model = LGBMClassifier(...)
...
cv = GridSearchCV(pipe, cv=ps, param_grid=param_grid, scoring='roc_auc')
cv.fit(x_train, y_train,
  model__eval_set=[(x_val, y_val)], model__eval_metric='auc', model__early_stopping_rounds=200, 
  model__verbose=500)
lbo34 commented 5 years ago

@seahrh : Really? it worked this way? I am getting "fit() got an unexpected keyword argument 'model__eval_set'"

suissemaxx commented 5 years ago

This example works for me:

gbm = lgb.LGBMRegressor(n_jobs=-1)

param_grid = {
    "num_leaves" : np.linspace(10, 200, 4, dtype=np.int32),
    'learning_rate': np.linspace(0.1, 1, 5),
    'n_estimators': np.linspace(10, 1000, 5, dtype=np.int32),
    'early_stopping_rounds' : [20],
}

gbm = GridSearchCV(gbm, param_grid, cv=3, scoring="neg_mean_squared_error", verbose=100, n_jobs=-1)
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric="rmse")
print('Best parameters:', gbm.best_params_)