Closed mandeldm closed 6 years ago
maybe you can use GroupKFold + ParameterGrid to simulate GridSearchCV. http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html
btw, could you give a simple runnable case we can debug on? we might try make GridSearchCV work. thanks
Here is a simple version of the code that runs fine. Uncommenting early_stopping_rounds=5
breaks it for me.
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV, GroupKFold
np.random.seed(1)
# Build 2 categorical features, 3 float ones & an error term:
categorical_df = pd.DataFrame(np.random.randint(2, size=(1000, 2)), columns=['cat01', 'cat02']).astype(int)
float_df = pd.DataFrame(np.random.rand(1000, 4), columns=['aa', 'bb', 'cc', 'error'])
# Build a group feature for GroupKFold (k=4):
kfold_df = pd.DataFrame(np.random.randint(4, size=(1000, )), columns=['grp'])
# Assemble features into a df.
df = pd.concat([categorical_df, float_df, kfold_df], axis=1)
# Create a dependent variable.
df['yy'] = \
df['aa'] + 2 * df['bb'] + \
df['cc'] * df['cat01'] + \
df['error'] * (1 + df['cat02'])
# Identify our X, y & categorical columns.
Xcols = ['cat01', 'cat02', 'aa', 'bb', 'cc']
y_col = 'yy'
categoricals = ['cat01', 'cat02']
# Set up and run LGBM with GridSearchCV based on GroupKFold.
gkf = GroupKFold(n_splits=4).split(X=df[Xcols],
y=df[y_col],
groups=df['grp'])
param_grid = {
'num_leaves': [31, 127],
'feature_fraction': [0.5, 1.0],
'bagging_fraction': [0.75, 0.95],
'reg_alpha': [0.1, 0.5]}
lgb_estimator = lgb.LGBMRegressor(boosting_type='gbdt',
objective='regression',
bagging_freq=5,
num_boost_round=50,
learning_rate=0.01,
eval_metric='l1',
categorical_feature=[Xcols.index(col) for col in categoricals])#,
# early_stopping_rounds=5) # REMOVING THIS ARGUMENT MAKES THE CODE RUN OKAY
gsearch = GridSearchCV(estimator=lgb_estimator,
param_grid=param_grid,
cv=gkf)
lgb_model = gsearch.fit(X=df[Xcols],
y=df[y_col])
print(lgb_model.best_params_, lgb_model.best_score_)
I check GridSearchCV codes, the logic is train and test; we need a valid set during training for early stopping, it should not be test set.
Except this, early_stopping_rounds should pass to fit function like
lgb_model = gsearch.fit(X=df[Xcols], y=df[y_col], eval_set=(df[Xcols], df[y_col]), early_stopping_rounds=5)
, though it may not be what you want.
When I add eval_set and early_stopping_rounds to fit function as you did, I get: TypeError: fit() got an unexpected keyword argument 'eval_set'
I'm using the following versions: Python 3.5.3 (WinPython) on Windows 10 Jupyter 4.3.0 np 1.13.3 pd 0.20.3 lgb 2.0.10 sklearn 0.18.1
UPDATE: I did not realize that GridSearchCV.fit began supporting kwargs in 0.19. Thank you, I will update my package.
But that does not really solve the original problem, since evaluating on the training set is not a great idea. Is my best option to write nested loops over both the validation folds and the parameter grid?
Thanks.
you can try our native api lightgbm.cv
I think that it is simpler that your last comment @mandeldm.
As @wxchan said, lightgbm.cv perform a K-Fold cross validation for a lgbm model, and allows early stopping.
At the end of the day, sklearn's GridSearchCV just does that (performing K-Fold) + turning your hyperparameter grid to a iterable with all possible hyperparameter combinations. This means that you could just use lightgbm.cv for hyperparameter optimization, with early stopping embedded in each experiment (each hyperparameter combination). A very naïve (but correct way) to do so would be something like:
import lightgbm
# Whatever dataset you have
from sklearn.model_selection import train_test_split
train, test = train_test_split( # with your dataset
# Imagine now that you want to optimize num_leaves and
# learning_rate, and also use early stopping:
num_leaves_choices = [56, 128, 256]
learning_rate_choices = [0.05, 0.1, 0.2]
# We will store the cross validation results in a simple list,
# with tuples in the form of (hyperparam dict, cv score):
cv_results = []
for num_lv in num_leaves_choices:
for lr in learning_rate_choices:
hyperparams = {"objective": # whatever,
"num_leaves": num_lv,
"learning_rate": lr,
# Other constant hyperparameters
}
validation_summary = lightgbm.cv(hyperparams,
train,
num_boost_round=4096, # any high number will do
nfold=10,
metrics=["auc"],
early_stopping_rounds=50, # Here it is
verbose_eval=10)
optimal_num_trees = len(validation_summary["auc-mean"])
# Let's just add the optimal number of trees (chosen by early stopping)
# to the hyperparameter dictionary:
hyperparams["optimal_number_of_trees"] = optimal_num_trees
# And we append results to cv_results:
cv_results.append((hyperparams, validation_summary["auc-mean"][-1]))
Obviously, the nested for loops is a very basic approach. Fortunately, scikit-learn's ParameterGrid can build an iterator for you :)
@wxchan Is this solved ?
I received the same error “ValueError: For early stopping, at least one dataset and eval metric is required for evaluation”, my code are like this,obviously, I've set the 'eval_set' and 'eval_metic' `lgbc_params = { 'boosting_type': "gbdt", # string, optional (default="gbdt") 'num_leaves': 31, # int, optional (default=31) 'max_depth': -1, # int, optional (default=-1) 'objective': 'binary', # string, callable or None, optional (default=None) 'random_state' : 100, # int or None, optional (default=None) 'n_jobs' : -1, # optional (default=-1) 'silent' : 0, # bool, optional (default=True) } lgbc_fit_params = { 'eval_set' : [(x_eval,y_eval)], # list or None, optional (default=None)
'eval_names' :['evalset'], # list of strings or None, optional (default=None)
'eval_metric' : ['f1'], # string, list of strings, callable or None, optional (default=None)
'early_stopping_rounds' : 10 # int or None, optional (default=None)
} n_lr = np.linspace(0.01,0.2,19) param_grid = { 'learning_rate':n_lr, }
lgbc = LGBMClassifier(lgbc_params) gs_lgbc = model_selection.GridSearchCV(lgbc,return_train_score=True, param_grid=param_grid, cv=5, refit=True, n_jobs=1,verbose=0,scoring='f1') gs_lgbc.fit(X=x_train, y=y_train, lgbc_fit_params) `
@Interstella12 I think f1
is not the eval_metric in LightGBM. you can set it to a right one.
@julioasotodv
Given your code example above, where you loop through each parameter, I see that you have divided the data into train and test set. But then you don't seem to have used the test data from the split anywhere for validation.
I think there is something missing in the code or I am missing something very obvious. Can you explain, what purpose does the train-test split serve in that case ?
@quakig Indeed, I am not using the test set in the example.
In supervised ML, it is a good practise to have a separate, hold-out test dataset just for metrics purposes, once you have selected you best model and hyperparameters. This means that I am first separating a test set, then using lightgbm.cv for performing K-fold cross validation to try different hyperparameter combinations on the train (+ validation) set, and once those are chosen, see how well the model performs in the test set.
https://cdn-images-1.medium.com/max/1000/1*4G__SV580CxFj78o9yUXuQ.png shows an example (the first diagram).
I have a similar problem that is not solved. I want to do a grid search on a whole pipeline, so I cannot use cv.
IMO, it would be much better to have a parameter validation_fraction
like GradientBoostingClassifier.
How is it done in cv? I suppose that it trains on k-2 folds, and uses one fold for validation and one for test. I don't think it would be particularly horrible to use a random fold for the validation.
I was able to use GridSearchCV with early stopping rounds. Working example
model = LGBMClassifier(...)
...
cv = GridSearchCV(pipe, cv=ps, param_grid=param_grid, scoring='roc_auc')
cv.fit(x_train, y_train,
model__eval_set=[(x_val, y_val)], model__eval_metric='auc', model__early_stopping_rounds=200,
model__verbose=500)
@seahrh : Really? it worked this way? I am getting "fit() got an unexpected keyword argument 'model__eval_set'"
This example works for me:
gbm = lgb.LGBMRegressor(n_jobs=-1)
param_grid = {
"num_leaves" : np.linspace(10, 200, 4, dtype=np.int32),
'learning_rate': np.linspace(0.1, 1, 5),
'n_estimators': np.linspace(10, 1000, 5, dtype=np.int32),
'early_stopping_rounds' : [20],
}
gbm = GridSearchCV(gbm, param_grid, cv=3, scoring="neg_mean_squared_error", verbose=100, n_jobs=-1)
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric="rmse")
print('Best parameters:', gbm.best_params_)
I'm using LGBMRegressor with sklearn.model_selection.GridSearchCV, with cross-validation split based on sklearn.model_selection.GroupKFold. When I include early_stopping_rounds=5 in the estimator, I get the following error:
ValueError: For early stopping, at least one dataset and eval metric is required for evaluation
Without the early_stopping_rounds argument the code runs fine.
I could be wrong, but it seems that LGBMRegressor does not view the
cv
argument in GridSearchCV andgroups
argument in GridSearchCV.fit as a legitimate eval dataset.Is there a proper way to use early_stopping_rounds with GridSearchCV / GroupKFold?
Thank you.