SequentialFeatureSelector hyperparameter propagation doesn't work properly with GridSearch

rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

https://rasbt.github.io/mlxtend/

Other

4.91k stars 871 forks source link

SequentialFeatureSelector hyperparameter propagation doesn't work properly with GridSearch #813

Closed ptoews closed 2 years ago

ptoews commented 3 years ago

Describe the bug

When using sklearn's GridSearchCV with SequentialFeatureSelector, the configured hyperparameter values are not properly propagated to the actual classifier that is used for fitting and predicting. I put together a MWE below that is based on example 8 in the docs, the only major change is the custom classifier.

In the output listed in the docs you can see that the score doesn't change with the k parameter of the KNN, which is very strange. While searching for similar issues I found that this has already been mentioned in multiple other issues, e.g. #456 and #511. Below you can see the unexpected behavior in the suggested approach.

Steps/Code to Reproduce

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend
import sklearn.base
import numpy as np

class DebugClassifier(sklearn.base.BaseEstimator):

    def __init__(self, max_depth=10):
        self.max_depth = max_depth

    def fit(self, X, y, groups=None):
        print("Fitting with max_depth =", self.max_depth)

    def predict(self, X, **kwargs):
        print("Predicting with max_depth =", self.max_depth)
        return np.zeros(len(X))

    def set_params(self, **kwargs):
        print("Setting params:", kwargs)
        super().set_params(**kwargs)
        print("max_depth after setparams:", self.max_depth)

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.2, random_state=123)

clf = DebugClassifier(max_depth=10)

sfs1 = SFS(estimator=clf, 
           k_features=3,
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)

pipe = Pipeline([('sfs', sfs1), 
                 ('clf', clf)])

param_grid = [
  {#'sfs__k_features': [1, 4],
   'sfs__estimator__max_depth': [1, 5]}
  ]

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  #iid=True,
                  refit=False)

# run gridearch
gs = gs.fit(X_train, y_train)

Expected Results

Setting params: {'max_depth': 1}
max_depth after setparams: 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
...

Actual Results

Setting params: {'max_depth': 1}
max_depth after setparams: 1
Fitting with max_depth = 10
Predicting with max_depth = 10
Fitting with max_depth = 10
Predicting with max_depth = 10
...

As you can see, the value 1 for the hyperparameter max_depth is correctly configured for some classifier, however while fitting and predicting it appears that a different classifier is used, where the default value of max_depth=10 is still set.

Versions

MLxtend 0.18.0
Linux-5.8.0-48-generic-x86_64-with-glibc2.29
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0]
Scikit-learn 0.24.1
NumPy 1.20.1
SciPy 1.6.1

ptoews commented 3 years ago

I've just realized that a part of this is happening intentionally as explained here: https://github.com/rasbt/mlxtend/issues/456#issuecomment-434427387 I.e. the classifiers used for feature selection and the surrounding grid search are supposed to be separate. I understand that there are good use cases for this, however since I'm at least the third person that expected the opposite I suggest to either default to sharing the classifier or at the very least document it clearly. It is very difficult to even find out what happened.

But there must still be something wrong because in above code the DebugClassifier is never fitted with the configured hyperparameter choices. Or am I missing even more?

Also, now I'm wondering how I can couple the hyperparameters of feature selection and grid search.

rasbt commented 3 years ago

Thanks for taking the time to put together this informative example and illustration of the issue. I agree with you that this is a bug or confusing behavior at the least.

It's probably due to cloning the estimator like you suggested. To get the desired behavior, I believe that setting

sfs1 = SFS(...
   clone_estimator=False)

should do the trick. This should be mentioned in the documentation. Also, it would be great to have a warning message if someone is using the defaults in the grid search context -- not sure how to catch this use case scenario well though.

ptoews commented 3 years ago

Thanks for your response! I just tried out your suggestion of disabling cloning the estimator, but the scores are still the same. I had another look at the DebugClassifier prints and here is what it says for the example above (with SFS being given cv=False for less messages):

Fitting 5 folds for each of 2 candidates, totalling 10 fits
Setting params: {'max_depth': 1}
max_depth after setparams: 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 10
Predicting with max_depth = 10
[CV] END ........................sfs__estimator__max_depth=1; total time=   0.0s

So the configured hyperparameter is used for fitting and predicting, but at the end there is still a classifier with the default settings used. Is it possible that it now does feature selection with the hyperparameters set by grid search but in the end it still uses default parameters? I don't understand what the purpose of the last fit/predict step is.

In comparison, here is the output with the pipeline containing only the DebugClassifier:

Fitting 5 folds for each of 2 candidates, totalling 10 fits
Setting params: {'max_depth': 1}
max_depth after setparams: 1
Fitting with max_depth = 1
Predicting with max_depth = 1
[CV] END ...................................clf__max_depth=1; total time=   0.0s

So the default parameters appear only when SFS is used.

Regarding the documentation I would suggest to definitely mention this in the example section, since "GridSearch" and "SequentialFeatureSelection" were the keywords that made me think this was exactly what I was looking for, but it is actually a different way of combining these methods. Maybe even make it two examples, one with a shared classifier, and one with clearly different classifiers for feature selection and grid search, so the user is aware of multiple options existing in this context.

rasbt commented 3 years ago

Thanks for looking into this further!

So the configured hyperparameter is used for fitting and predicting, but at the end there is still a classifier with the default settings used. Is it possible that it now does feature selection with the hyperparameters set by grid search but in the end it still uses default parameters? I don't understand what the purpose of the last fit/predict step is.

I was confused about that at first too -- this shouldn't happen. But then looking over your code more closely, this is because of the following line:

clf = DebugClassifier(max_depth=10)

sfs1 = SFS(estimator=clf, 
           k_features=3,
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)

pipe = Pipeline([('sfs', sfs1), 
                 ('clf', clf)])  ### <---------------------------

param_grid = [
  {#'sfs__k_features': [1, 4],
   'sfs__estimator__max_depth': [1, 5]}
  ]

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  #iid=True,
                  refit=False)

I think what's happening here is that it is tuning the sequential feature selector internally, and then it is somehow using the original classifier. I think that's because GridSearchCV may have made a clone of it. I am not exactly sure.

To avoid this issue, you can make it two different entities. One is used internally by the sequential feature selector, and one is used externally.

E.g.,

clf1 = DebugClassifier(max_depth=10)
clf2 = DebugClassifier(max_depth=7)

sfs1 = SequentialFeatureSelector(estimator=clf1, 
           k_features=3,
           forward=True, 
           floating=False, 
           scoring='accuracy',
           clone_estimator=False,
           cv=5)

pipe = Pipeline([('sfs', sfs1), 
                 ('clf2', clf2)])

param_grid = [
  {
   'sfs__estimator__max_depth': [1, 5],
   'clf2__max_depth': [1, 5],
  }]

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=2,
                  #iid=True,
                  refit=False)

Broadly, it is best to think of the classifier in the SequentialFeatureSelector as being decoupled from the one in the Pipeline. Unfortunately, I don't think there is a way around that except maybe for writing a custom Pipeline/GridSearchCV class.

But yeah, this should probably be made more clear in the documentation.

ptoews commented 3 years ago

Thanks for the tips, I believe the following solution based on your suggestions should do what I want: I encapsulated the classifier in the SFS and disabled cloning, which allowed GridSearch to configure the hyperparameters of the classifier that is also used for feature selection. But GridSearch wants to score/predict with the given estimator, which isn't supported by SFS, so I added a simple wrapper that delegates the predicting while transforming the data with the optimized feature selection.

# Same imports as above

class SFSWrapperEstimator(SFS):
    def predict(self, X, *args, **kwargs):
        return self.estimator.predict(self.transform(X), *args, **kwargs)

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.2, random_state=123)

clf = KNeighborsClassifier()

sfs1 = SFSWrapperEstimator(estimator=clf, 
           k_features=3,
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=False,
           clone_estimator=False)

param_grid = [{
   'estimator__n_neighbors': [1, 5]
  }]

gs = GridSearchCV(estimator=sfs1, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  #iid=True,
                  refit=True,
                  )

# run gridearch
gs = gs.fit(X_train, y_train)

for i in range(len(gs.cv_results_['params'])):
    print(gs.cv_results_['params'][i], 'test acc.:', gs.cv_results_['mean_test_score'][i])
gs.best_estimator_.estimator.n_neighbors

Output:

{'estimator__n_neighbors': 1} test acc.: 0.37499999999999994
{'estimator__n_neighbors': 5} test acc.: 0.6583333333333334
5

I checked the output of the DebugClassifier as well and everything looks good, but not sure if I missed something. If you think this is useful, maybe the prediction support could be added to SFS or the code could be added to the examples. But I feel that the wrapping could be done in a more elegant way.

Edit: Just realized that SFS doesn't refit the estimator after feature selection but leaves it with the last tried combination.

rasbt commented 3 years ago

This is nice. Yeah, I think it should be added to the library (or the documentation at least).

Edit: Just realized that SFS doesn't refit the estimator after feature selection but leaves it with the last tried combination.

It could be a general GridSearchCV behavior. I.e., the best estimator from GridSearchCV behaves correctly:

Whether I run

param_grid = [{
   'estimator__n_neighbors': [50, 3]
  }]

gs = GridSearchCV(estimator=sfs1, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  #iid=True,
                  refit=True,
                  )

# run gridearch
gs = gs.fit(X_train, y_train)

for i in range(len(gs.cv_results_['params'])):
    print(gs.cv_results_['params'][i], 'test acc.:', gs.cv_results_['mean_test_score'][i])
gs.best_estimator_.estimator.n_neighbors

or run the following (order of params flipped)

param_grid = [{
   'estimator__n_neighbors': [3, 50]
  }]

gs = GridSearchCV(estimator=sfs1, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  #iid=True,
                  refit=True,
                  )

# run gridearch
gs = gs.fit(X_train, y_train)

for i in range(len(gs.cv_results_['params'])):
    print(gs.cv_results_['params'][i], 'test acc.:', gs.cv_results_['mean_test_score'][i])
gs.best_estimator_.estimator.n_neighbors

it both results in the same best estimator params.

When I then check

sfs1.estimator.get_params()['n_neighbors']

it prints 5 (the default value passed in before grid search, that's probably GridSearchCV is working with a clone)

ptoews commented 3 years ago

Interesting. But that doesn't seem to have been the issue here, because I accessed the (cloned) optimized SFS via gs.best_estimator_. I still think that SFS doesn't necessarily have a reference to the best estimator it fitted, because I noticed that after GridSearch fitted SFS it couldn't evaluate its score, since the last iteration was fitted with 3 features but the evaluation called SFS with just 1 feature as a result of the optimization. Looking at the code of SFS.fit() I couldn't see any updates to the estimator attribute after optimization either.

So I updated my solution by also overriding fit() and simply fitting the internal estimator with the optimized feature selection afterwards, like this:

class SFSWrapperEstimator(SFS):

    def fit(self, X, y, **kwargs):
        super().fit(X, y, **kwargs)
        self.estimator.fit(self.transform(X), y, **kwargs)

    def predict(self, X, *args, **kwargs):
        return self.estimator.predict(self.transform(X), *args, **kwargs)

It's quite ugly and inefficient, a better solution would be to store all the fitted estimators of SFS or at least their parameters to be able to provide the best estimator afterwards in a dedicated attribute and without additional fitting. However, so far this seems to work, and I will continue to check if everything is correct now.

rasbt commented 3 years ago

Yesterday when you first mentioned that I checked the SFS code as well and couldn't find anything about the refitting after optimization either. It is a bit hard to find out what's going on exactly because there are so many layers (SFS, pipeline, GridSearchCV). The proposed workaround with the SFSWrapperEstimator seems like a good one for making sure it works as intended. We can add that to the documentation for now until we may find a better solution.

ptoews commented 3 years ago

Sounds good. Apart from that, using the initial approach still seems to cause a bug (cited below), right?

But there must still be something wrong because in above code the DebugClassifier is never fitted with the configured hyperparameter choices.

It doesn't affect my suggested solution but might still be an issue for others.

rasbt commented 2 years ago

I just had some time to revisit this. I think the issue(s) can currently be avoided by not using refit=True and using

pipe.set_params(**gs.best_params_).fit(X_train, y_train)

instead. I am updated the docs via #875 accordingly. Also, at this point it may make sense not to change the SFS behavior but to focus more on the replacement via #834