rodrigo-arenas / Sklearn-genetic-opt

ML hyperparameters tuning and features selection, using evolutionary algorithms.
https://sklearn-genetic-opt.readthedocs.io
MIT License
286 stars 73 forks source link

RuntimeError: Cannot clone object GAFeatureSelectionCV(...), as the constructor either does not set or modifies parameter estimator #127

Closed RNarayan73 closed 1 year ago

RNarayan73 commented 1 year ago

System information OS Platform and Distribution: Windows 11 Home Sklearn-genetic-opt version: 0.10.0 deap version: 1.3.3 Scikit-learn version: 1.2.1 Python version: 3.10.1

Describe the bug When including GAFeatureSelectionCV as a transformer within a pipeline to carry out feature selection and then running GridSearchCV or GASearchCV on the pipeline to optimise hyperparameters, it throws up an error:

RuntimeError: Cannot clone object GAFeatureSelectionCV(estimator=LGBMClassifier(), generations=5, n_jobs=14, population_size=5), as the constructor either does not set or modifies parameter estimator

To Reproduce Steps to reproduce the behavior:

from sklearn.datasets import load_iris

from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.model_selection import GridSearchCV

iris = load_iris()

test_pipe = Pipeline([
                    # 1 Feature Selection using GAFeatureSelectionCV
                      ('dim', GAFeatureSelectionCV(LGBMClassifier(), 
                                                   generations=5, population_size=5, 
                                                   n_jobs=-1, 
                                                  )
                      ), 
                      ('clf', SGDClassifier())
                     ]
                    )

grid_search_pipe = GridSearchCV(test_pipe, 
                                param_grid={'clf__alpha': [10e-04, 10e-03, 10e-02, 10e-01, 10e+00]}, 
                                verbose=1
                               )

grid_search_pipe.fit(iris.data, iris.target)

Expected behavior The pipeline should be fitted without any errors.

Additional context This situation arises when trying to wrap a whole pipeline with a hyperparameter tuning class such as GridSearchCV or GASearchCV. The purpose of including the pipeline within *SearchCV is to optimise hyperparameters of additional transform steps before the 'dim' step along with the hyperparameters of the classifier, although such steps are not shown above for brevity.

rodrigo-arenas commented 1 year ago

Hi, @RNarayan73 I understand what you're trying to accomplish, there are a few things to notice:

from sklearn.datasets import load_iris
from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.model_selection import GridSearchCV

iris = load_iris()

feature_selection = GAFeatureSelectionCV(LGBMClassifier(),
                                         generations=5, population_size=5,
                                         n_jobs=-1,
                                         )
grid_search = GridSearchCV(SGDClassifier(),
                           param_grid={'alpha': [10e-04, 10e-03, 10e-02, 10e-01, 10e+00]},
                           )

ga_search_pipe = Pipeline([("dim", feature_selection), ("clf", grid_search)])

ga_search_pipe.fit(iris.data, iris.target)

Having said that, I'll investigate what is causing the error, it seems at a first sight that it has to be with the way the set_params works to clone the underlying estimator

I hope this helps

RNarayan73 commented 1 year ago

Hello @rodrigo-arenas

Thank you for your reply and for investigating the issue further.

In general, I'd not suggest mixing feature selection and hyperparameter tuning in the same iteration, this not only creates a large model (a whole feature selection algorithm per each hyperparameter candidate) but also has some other consequences on the optimization

With regard to your comment below, there are different approaches. Yes, it is a more challenging problem over a wider search space, but given the stochastic nature of ML, doing it together in fact improves the robustness of the model. Furthermore, having them together in the pipeline is the only way to also tune hyperparameters for the Feature Selection step too. There is a good amount of literature supporting this approach and I share some links below which advocate this approach. https://machinelearningmastery.com/machine-learning-modeling-pipelines/

I hope you will be able to fix it soon.

Regards

Narayan

rodrigo-arenas commented 1 year ago

@RNarayan73 this has been solved in PR #128, you can clone the repo to test it out

ananzibian commented 1 year ago

System information Windows 10 Sklearn-genetic-opt version: 0.10.1 Describe the bug When import module from sklearn_genetic import GAFeatureSelectionCV, ExponentialAdapter it throws up an error: ImportError: cannot import name '_estimator_has' from 'sklearn.feature_selection._from_model' (F:\anaconda\lib\site-packages\sklearn\feature_selection_from_model.py)

rodrigo-arenas commented 1 year ago

Hi @ananzibian the error you showing is not related to this bug But I think what is happening is that you might have an old version of scikit-learn which doesn't have the _estimator_has function

Please install a more recent version, for example pip install scikit-learn==1.2.1

rodrigo-arenas commented 1 year ago

@RNarayan73 this has been fixed and released in version 0.10.1

RNarayan73 commented 1 year ago

@rodrigo-arenas, thank you for the fix.