rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.86k stars 857 forks source link

Does stacking regressor work with sklearn GridSearchCV() with hyper parameter search? #760

Closed jpandeinge closed 3 years ago

jpandeinge commented 3 years ago

I had a query about whether stacking regressor supports a sklearn GrindSearchCV() where I use an algorithm for hyperparameter tuning for optimization. Sample code:

seed = 1
from mlxtend.regressor import StackingCVRegressor
from sklearn.svm import SVR
from lightgbm import LGBMRegressor
from sklearn.linear_model import LassoCV,LinearRegression
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold```

# linear Regression
lin_reg = LinearRegression(normalize =True, fit_intercept =False)

# CatBoot Regressor
cat_boost = CatBoostRegressor(random_seed=seed, depth=4)

# Epsilon-Support Vector Regression.
svr = SVR(C = 1,kernel='poly', degree = 5) 

# Lasso linear model with iterative fitting along a regularization path.
lasso = LassoCV(
  alphas=[0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1,0.3, 0.6, 1]
  ,max_iter=1000 
  ,tol = 1e-2
  ,random_state=seed,
  fit_intercept = True
  ,cv= KFold(n_splits= 5, shuffle =True,random_state=seed)
  ,verbose=True,
  normalize = True,
  n_jobs = -1
)

# decision tree regressor.
dt_regressor = DecisionTreeRegressor(max_depth=4,random_state = seed)

# random forest regressor.
rf = RandomForestRegressor(random_state=seed, n_estimators = 100, verbose=seed)

# xgb regressor 
xgb_regressor = XGBRegressor(n_estimators = 500,colsample_bytree=1,
                             objective='reg:squarederror', eval_metric ='rmse',
                             importance_type='weight',
                             random_state= seed, verbose=seed)
# LightGBM regressor.
lgbm_regressor = LGBMRegressor(objective ='regression',
                               importance_type='weight',
                               boosting_type='rf', bagging_fraction=0.8, bagging_freq = 1,
                               n_leaves =31, n_estimators= 500, learning_rate =0.015, random_state=seed, metric='rmse', verbose=seed)

#  AdaBoost regressor.
ada_boost = AdaBoostRegressor(dt_regressor, random_state = seed, n_estimators = 100)

forecaster = StackingCVRegressor(regressors=(lin_reg, lasso, svr, lgbm_regressor, xgb_regressor, cat_boost, ada_boost,),
                            meta_regressor= lin_reg,
                            shuffle = True,
                            cv = 10,
                            use_features_in_secondary=True
                            ) 
from sklearn.model_selection import GridSearchCV

params = {'estimator__linearregression__fit_intercept': ['True', 'False'],
          'estimator__linearregression__normalize' : ['True', 'False']}

grid = GridSearchCV(
    estimator=forecaster, 
    param_grid=params, 
    cv=5,
    refit=True
)

grid.fit(x_train, y_train)

print("Best: %f using %s" % (grid.best_score_, grid.best_params_))

I believe the error is from the params variable that I defined, I just don't seem to get it since I tried to implement all the parameters for all the regressors that I got using the grid.get_params().keys().

However, the above code leads to an error below;

ValueError: Invalid parameter estimator for estimator StackingCVRegressor(cv=10,
                    meta_regressor=LinearRegression(fit_intercept=False,
                                                    normalize=True),
                    regressors=(LinearRegression(fit_intercept=False,
                                                 normalize=True),
                                LassoCV(alphas=[0.0001, 0.0003, 0.0006, 0.001,
                                                0.003, 0.006, 0.01, 0.03, 0.06,
                                                0.1, 0.3, 0.6, 1],
                                        cv=KFold(n_splits=5, random_state=1, shuffle=True),
                                        n_jobs=-1, normalize=True,
                                        random_state=1, tol=0.01,
                                        verbose=...
                                             random_state=1, reg_alpha=None,
                                             reg_lambda=None,
                                             scale_pos_weight=None,
                                             subsample=None, tree_method=None,
                                             validate_parameters=None,
                                             verbose=1, verbosity=None),
                                <catboost.core.CatBoostRegressor object at 0x7f1bf9231a10>,
                                AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=4,
                                                                                       random_state=1),
                                                  n_estimators=100,
                                                  random_state=1)),
                    use_features_in_secondary=True). Check the list of available parameters with `estimator.get_params().keys()`.

Is there a way to tune all the parameters for all the regressors used in order to have the best optimal ones and try it on a new model? Is there a way to retain them since I would like to do a hyper parameter search?

jpandeinge commented 3 years ago

I solved it, I defined the parameters in the params variable wrongly. And changed 'params` from

params = {'estimator__linearregression__fit_intercept': ['True', 'False'],
          'estimator__linearregression__normalize' : ['True', 'False']}

to (below) by removing estimator__ in front of every base models used.

params = {'linearregression__fit_intercept': ['True', 'False'],
          'linearregression__normalize' : ['True', 'False']}