GridSearchCV & SequentialFeatureSelector to find best params & features

rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

https://rasbt.github.io/mlxtend/

Other

4.91k stars 870 forks source link

GridSearchCV & SequentialFeatureSelector to find best params & features #511

Closed armgilles closed 5 years ago

armgilles commented 5 years ago

Hey @rasbt

I'm strangling to find the best features & tuning using SequentialFeatureSelector and GridSearchCV.

I would like to test for each of my param_grid the best combinaison of features.

Code to reproduce :


from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

RANDOM_SEED = 42

boston = load_boston()
data, y = load_boston(return_X_y=True)

X = pd.DataFrame(data, columns=boston.feature_names)
rf = RandomForestRegressor(random_state=RANDOM_SEED)

# I want to transform my target in log
clf = TransformedTargetRegressor(regressor=rf,
                                 func=np.log1p,
                                 inverse_func=np.expm1)

sfs = SFS(clf, 
          k_features=(1, X.shape[1]),
          forward=False, 
          floating=True, 
          scoring='neg_mean_absolute_error',
          verbose=1,
          n_jobs=-1,
          cv=3)

pipe_clf = Pipeline(steps=[('sfs', sfs),
                           ('clf', clf)])

param_grid = [{
    'sfs__estimator__regressor__max_depth' : [1, 3]
}]

GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
                    n_jobs=-1, scoring='neg_mean_absolute_error',
                    return_train_score=True,
                    verbose=True, refit=True)

# GDCV.fit(X, y)  Fail but not my main problem here
GDCV.fit(X.values, y) # OK

cv_result = pd.DataFrame(GDCV.cv_results_)
cv_result.sort_values('rank_test_score')

max_depth params don't change mean_train_score and mean_test_score and don't have best k_features with sfs.

What I am missing ?

rasbt commented 5 years ago

Not sure, but maybe only 1 feature gets selected so that the max depth doesn't have an effect? Can you run the grid search 2 times seperately, one time with

param_grid = [{
    'sfs__estimator__regressor__max_depth' : [1]
}]

and one time with

param_grid = [{
    'sfs__estimator__regressor__max_depth' : [3]
}]

And then maybe check the sfs.k_feature_idx_ of the resulting estimators to get a better idea of what's going on?

armgilles commented 5 years ago

Running sfs__estimator__regressor__max_depth failed :

param_grid = [{
    'sfs__estimator__regressor__max_depth' : [1]
}]

GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
                    n_jobs=-1, scoring='neg_mean_absolute_error',
                    return_train_score=True,
                    verbose=True, refit=True)
GDCV.fit(X.values, y)

ValueError: Invalid parameter regressor for estimator RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.

I have to use sfs__estimator__max_depth from SequentialFeatureSelector.

param_grid = [{
    'sfs__estimator__max_depth' : [1]
}]

GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
                    n_jobs=-1, scoring='neg_mean_absolute_error',
                    return_train_score=True,
                    verbose=True, refit=True)
GDCV.fit(X.values, y)

cv_result = pd.DataFrame(GDCV.cv_results_)
cv_result

param_grid = [{
    'sfs__estimator__max_depth' : [3]
}]

GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
                    n_jobs=-1, scoring='neg_mean_absolute_error',
                    return_train_score=True,
                    verbose=True, refit=True)
GDCV.fit(X.values, y)

cv_result = pd.DataFrame(GDCV.cv_results_)
cv_result

Both have the same result.


GDCV.best_estimator_.named_steps['sfs'].k_feature_idx_
# (0, 3, 4, 5, 6, 9, 11, 12)

It seems sfs works but it doesn't appear in the GridSearchCV report.


GDCV.best_estimator_.named_steps['sfs'].transform(X)[0:2]
array([[6.320e-03, 0.000e+00, 5.380e-01, 6.575e+00, 6.520e+01, 2.960e+02,
        3.969e+02, 4.980e+00],
       [2.731e-02, 0.000e+00, 4.690e-01, 6.421e+00, 7.890e+01, 2.420e+02,
        3.969e+02, 9.140e+00]])

pnb commented 5 years ago

The error ValueError: Invalid parameter regressor for estimator RandomForestRegressor is probably caused by taking out the TransformedTargetRegressor step. Did you do that?

I took that out, and was able to reproduce your k_feature_idx_ result. With it still in there, using sfs__estimator__regressor__max_depth of 1 or 3 produces the same answer: k_feature_idx_ = (0, 5, 10, 11, 12). That is why the final result doesn't change -- the same features are selected regardless of max_depth.

armgilles commented 5 years ago

Good catch @pnb

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import GridSearchCV, cross_val_score
import pandas as pd
import numpy as np

RANDOM_SEED = 42

boston = load_boston()

data, y = load_boston(return_X_y=True)

X = pd.DataFrame(data, columns=boston.feature_names)
rf = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10)

clf = TransformedTargetRegressor(regressor=rf,
                                 func=np.log1p,
                                 inverse_func=np.expm1)

sfs = SFS(clf, 
          k_features=(1, X.shape[1]),
          forward=False, 
          floating=True, 
          scoring='neg_mean_absolute_error',
          verbose=1,
          n_jobs=-1,
          cv=3)

pipe_clf = Pipeline(steps=[('sfs', sfs),
                           ('clf', clf)])

param_grid = [{
    'sfs__estimator__regressor__max_depth' : [3],#[1, 3]
#    'clf__regressor__max_depth' : [3]#[1, 3]
}]
GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
                    n_jobs=-1, scoring='neg_mean_absolute_error',
                    return_train_score=True,
                    verbose=False, refit=True)
GDCV.fit(X.values, y)
GDCV.best_score_ 
# -3.8559436923367265

GDCV.best_estimator_.named_steps['sfs'].k_feature_idx_
# (0, 5, 8, 9, 10, 11, 12)

# Checking our features selection SFS report 
features_select_report = pd.DataFrame.from_dict(GDCV.best_estimator_.named_steps['sfs'].get_metric_dict()).T
features_select_report.sort_values('avg_score', ascending=0)

GridSearchCV gives us a score of -3.85594 while SequentialFeatureSelector best score with features selection (on the same algo and same validation strategy) gives us -3.69947

Seems weird no?

Checking best features selection on cross_val_score :


# from GDCV.best_estimator_.named_steps['sfs'].k_feature_idx_
best_features = ['CRIM', 'RM', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

# Same algo as previously
rf_best = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10,
                               max_depth=3)

clf_best = TransformedTargetRegressor(regressor=rf_best,
                                 func=np.log1p,
                                 inverse_func=np.expm1)

cv_score_best = cross_val_score(clf_best, 
                                X[best_features].values, y, cv=3,
                                scoring='neg_mean_absolute_error')
print(cv_score_best)
# [-2.3679384  -4.97948918 -5.40380292] 
print(cv_score_best.mean())
# -4.250410167013302

So GridSearchCV (best score) != SequentialFeatureSelector(best score) != cross_val_score based on same data / validation strategy / algo / params

Did I miss something ?

pnb commented 5 years ago

Hi @armgilles,

It looks to me like the validation strategies are not actually the same. What you have is:

SequentialFeatureSelector: MAE comes from nested cross-validation within training data provided from GridSearchCV (i.e., 3-fold within 3-fold)
- This has the smallest training data (2/3 of 2/3 = 4/9), so it is not too surprising that it also has the worst MAE
GridSearchCV: 3-fold cross-validation with features chosen via nested cross-validation, but MAE evaluated on model built from 2/3 of data
cross_val_score: 3-fold cross-validation with features chosen from all folds of the data
- This approach leaks information from testing data (i.e., which features are good) into training data; hence, it has the best MAE, but is an invalid result

I could be mistaken about how this is actually working, but that's how it appears to me from reading your code.

armgilles commented 5 years ago

Hum I see...

So my data (all my learning dataset) is first cut by GridSearchCV (cv=3) :

data-> X_train_grid (2/3) & X_test_grid (1/3)

Then with SequentialFeatureSelector our dataset from GridSearchCV split is cut again (cv=3):

X_train_grid -> X_train_sfs (2/3 of X_train_grid from GridSearchCV) & X_test_sfs (1/3)

Result from GridSearchCV :

Learn from X_train_grid and test on X_test_grid

Result from SequentialFeatureSelector :

Learn from X_train_sfs (2/3 of X_train_grid) and test on X_test_sfs

Question :

How can I look for best features selection given a a specific GridSearchCV params list based on a Cross-Validation using all our dataset (not a subset of a subset of our data) ?

Dirty solution :

Manual loop on GridSearchCV params list and fit SequentialFeatureSelector to log result (concat all sfs.get_metric_dict()with current GridSearchCV params) to find best combinaison of estimator.get_params().keys() and features selection.

pnb commented 5 years ago

I'm not 100% sure I understand the question, but I have a couple ideas in case they are aligned with what you're trying.

Is your goal to determine the best features using the entire training set? I would not recommend that, but if you wanted to you could use the custom cross-validation option, and implement a custom model selection data splitter (or use a list of indices) which would provide all data as both train and test indices. This could over-fit your selection of features to the training data quite a bit, which will not invalidate your result (since no testing set information is leaked) but will probably reduce accuracy.
Is your goal instead to find the best features using the same parameters for both fitting the classifier for feature selection AND fitting the classifier for testing accuracy, to avoid grid searching once for feature selection and once model selection? If so I don't know of a super easy way, but using the memory argument for the Pipeline can save a ton of time by converting the complexity from k^2 to 2k, for k hyperparameter combinations.
Is your goal to find the best features by training clf on the training data and evaluating how good the set of features is on the testing data? If so you could simply run sfs.fit_transform(X), which would run your 3-fold cross-validation, training on 2/3 of X and testing on the remaining 1/3 for three iterations. However, this would be sort of useless for most uses since the resulting list of features would include leaked testing set information.

armgilles commented 5 years ago

I think it's 1.. Validation strategy is a exciting area and I fear bad / random / unlucky / lucky test set.

To summarize my problem :

Find the best features combinaison with a given list of params for a given estimator.

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import TransformedTargetRegressor
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Create my data
RANDOM_SEED = 42
boston = load_boston()
data, y = load_boston(return_X_y=True)
X = pd.DataFrame(data, columns=boston.feature_names)

# List of params
max_depth_params = [3, 4]

# To store results from multiple SFS
features_select_report = pd.DataFrame()

# Loop on my params
for i_max_depth in max_depth_params:
    print("Features selection with max_depth : " + str(i_max_depth))
    rf = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10, 
                           max_depth=i_max_depth)

    clf = TransformedTargetRegressor(regressor=rf,
                                     func=np.log1p,
                                     inverse_func=np.expm1)

    sfs = SFS(clf, 
          k_features=(1, X.shape[1]),
          forward=False, 
          floating=True, 
          scoring='neg_mean_absolute_error',
          verbose=0,
          n_jobs=-1,
          cv=3)
    sfs.fit(X.values, y)

    # Store result
    features_select_report_temp = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
    features_select_report_temp['i_max_depth'] = i_max_depth
    features_select_report = pd.concat([features_select_report, features_select_report_temp])

features_select_report.sort_values('avg_score', ascending=0).head(10)

Best score is 3.93827 with max_depth=4 and k_feature_idx_= (0, 1, 2, 3, 4, 9, 12). max_depth=3 is in 10th position (-4.04314).

With only one param (max_depth), it's pretty straightforward but can be more complex with many of them (job of GridSearchCV).

Checking result from SFS

max_dept=4

# Best features selection for max_depth=4
# (0, 1, 2, 3, 4, 9, 12)
best_features = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'TAX', 'LSTAT']
rf = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10, 
                           max_depth=4)

clf = TransformedTargetRegressor(regressor=rf,
                                 func=np.log1p,
                                 inverse_func=np.expm1)

cv_score = cross_val_score(clf, 
                           X[best_features].values, y, cv=3,
                           scoring='neg_mean_absolute_error')
print(cv_score)
# [-2.73943653 -5.219839   -3.85554654]
print(cv_score.mean())
# -3.9382740221213233

Same result as SFS for this params (good)

max_dept=3

# Best features selection for max_depth=3
# (0, 1, 8, 10, 12)
best_features = ['CRIM', 'ZN', 'RAD', 'PTRATIO', 'LSTAT']
rf = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10, 
                           max_depth=3)

clf = TransformedTargetRegressor(regressor=rf,
                                 func=np.log1p,
                                 inverse_func=np.expm1)

cv_score = cross_val_score(clf, 
                           X[best_features].values, y, cv=3,
                           scoring='neg_mean_absolute_error')
print(cv_score)
# [-2.53444967 -5.2265473  -4.39539221]
print(cv_score.mean())
# -4.052129726807058

Different results with SFS :

SFSwith max_depth=3 = 4.04314
cross_val_scorewith max_depth=3 = -4.05213

I don't understand why there is a difference here (pretty small but still a difference on this toy dataset)...

pnb commented 5 years ago

My guess is there is some differences in random seeds, resulting in different cross-validation folds. If you put np.random.seed(RANDOM_SEED) directly before every time calling cross_val_score and re-run everything do the results come out the same?

I also tried copying and pasting your code (the first of the three you posted just now) into a new python file, ran it, and came up with avg_score = -4.50213 in the 10th row of the DataFrame, which differs from your DataFrame and matches your final result. That also suggests maybe the random seed got changed for the KFold that cross_val_score is doing internally, if you ran the code more than once for example.

Also, just to reiterate: these cross_val_score outputs are not really valid, since you are overfitting to the testing data. But that is beside the point a bit.

armgilles commented 5 years ago

cross_val_score don't have random_state (it seems deterministic link).

I add np.random.seed(RANDOM_SEED) and run multi-time everything. Got the same result (from my last post).

Now using KFold to fix my CV split in SFS & cross_val_score :

from sklearn.model_selection import KFold
my_kf = KFold(n_splits=2, random_state=RANDOM_SEED)

Best score is -3.21061 with max_depth=4 and k_feature_idx_= (0, 3, 4, 5, 7, 10, 12)
max_depth=3 best score is now at 12th with a score at -3.46979 and k_feature_idx_= (0, 3, 4, 5, 10, 12)

I can now reproduce SFS& cross_val_score (code part 2 & 3 from my last post) (WIN) !

Difference may come from SFS's CV implementation and cross_val_score (weird) ?

I test the code on Mac OS & Linux, have some little change in my results (du to different versions of compiler GCC I think).

I don't get the point of overfitting on cross_val_score (maybe I'm tired). On cv=3 I will learn and test on 3 different parts of my entire data (X), .mean() the result is better (in theory) than splitting your data in X_train / X_test and could overfit on your X_test (luck, no outlier, etc.) no ?

Sorry to disturb & thanks for your past answers, it's more a use case of reproduced results (and believe in it 👍 ).

pnb commented 5 years ago

Good call that shuffle = False in KFold by default. As for differences in CV implementation, however, I don't think there are many. From what I can tell in the source code, SFS uses the sklearn K-fold implementation.

Differences between OSX/Linux could also be due to differing versions of packages, like numpy or sklearn or any of the other pieces that might slightly influence the random numbers that get generated. Or it might be compiler too as you suggested.

The validity issue with your part 1 code is that you are trying multiple values of max_depth without a separate validation set to ensure that you are not picking the value based on testing set performance. With 2 values the risk is minimal, but with 100 or 1000 hyperparameter options you will surely find some that work well purely by chance in your testing data, but don't generalize well to new data. This is sort of like p-hacking in statistics. The model parameters are not themselves overfitted, but the model selection process is (the hyperparameters).

Similarly, the validity issue with part 2 and 3 is that you have selected the best features based on how well they worked on the testing set. With lots of features you will surely find some features that work well on the testing set purely by chance, but don't generalize. To properly cross-validate, you can't use any information from the testing set -- not even how well something worked. As a consequence, if you want to try multiple sets of hyperparameters (usually a good idea), then you need nested cross-validation.

rasbt commented 5 years ago

Sorry for the late response here ... it's been a busy semester and I haven't had a chance to thoroughly read through this thread. However, I did some code cleanup/rewrite regarding the get_params and set_params inside the SequentialFeatureSelector (#529) which may address a potential issue there if it existed.

zheng-gt commented 2 years ago

@armgilles Hi, Armand, I am very happy to find your post, and I am facing the same problem as you when using the GridSearchCV & SequentialFeatureSelector to find the best features combination with a given list of params (n_estimators and max_depth) for a random forest estimator.

I noticed that you combined for-loop and SequentialFeatureSelector to get the best features combination but I didn't get the point of your later discussion about the validity issue.

My question is that is the following process appropriate?

split data into train and test first,
search for the best features combination and the corresponding parameters of random forest estimator using for-loop and SFS according to your code below 3 then validate in the test?

I think it's 1.. Validation strategy is a exciting area and I fear bad / random / unlucky / lucky test set.

To summarize my problem :

Find the best features combinaison with a given list of params for a given estimator.

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import TransformedTargetRegressor
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Create my data
RANDOM_SEED = 42
boston = load_boston()
data, y = load_boston(return_X_y=True)
X = pd.DataFrame(data, columns=boston.feature_names)

# List of params
max_depth_params = [3, 4]

# To store results from multiple SFS
features_select_report = pd.DataFrame()

# Loop on my params
for i_max_depth in max_depth_params:
    print("Features selection with max_depth : " + str(i_max_depth))
    rf = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10, 
                           max_depth=i_max_depth)

    clf = TransformedTargetRegressor(regressor=rf,
                                     func=np.log1p,
                                     inverse_func=np.expm1)

    sfs = SFS(clf, 
          k_features=(1, X.shape[1]),
          forward=False, 
          floating=True, 
          scoring='neg_mean_absolute_error',
          verbose=0,
          n_jobs=-1,
          cv=3)
    sfs.fit(X.values, y)

    # Store result
    features_select_report_temp = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
    features_select_report_temp['i_max_depth'] = i_max_depth
    features_select_report = pd.concat([features_select_report, features_select_report_temp])

features_select_report.sort_values('avg_score', ascending=0).head(10)

Best score is 3.93827 with max_depth=4 and k_feature_idx_= (0, 1, 2, 3, 4, 9, 12). max_depth=3 is in 10th position (-4.04314).

With only one param (max_depth), it's pretty straightforward but can be more complex with many of them (job of GridSearchCV).

Checking result from SFS

max_dept=4

# Best features selection for max_depth=4
# (0, 1, 2, 3, 4, 9, 12)
best_features = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'TAX', 'LSTAT']
rf = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10, 
                           max_depth=4)

clf = TransformedTargetRegressor(regressor=rf,
                                 func=np.log1p,
                                 inverse_func=np.expm1)

cv_score = cross_val_score(clf, 
                           X[best_features].values, y, cv=3,
                           scoring='neg_mean_absolute_error')
print(cv_score)
# [-2.73943653 -5.219839   -3.85554654]
print(cv_score.mean())
# -3.9382740221213233

Same result as SFS for this params (good)

max_dept=3

# Best features selection for max_depth=3
# (0, 1, 8, 10, 12)
best_features = ['CRIM', 'ZN', 'RAD', 'PTRATIO', 'LSTAT']
rf = RandomForestRegressor(random_state=RANDOM_SEED,
                           n_estimators=10, 
                           max_depth=3)

clf = TransformedTargetRegressor(regressor=rf,
                                 func=np.log1p,
                                 inverse_func=np.expm1)

cv_score = cross_val_score(clf, 
                           X[best_features].values, y, cv=3,
                           scoring='neg_mean_absolute_error')
print(cv_score)
# [-2.53444967 -5.2265473  -4.39539221]
print(cv_score.mean())
# -4.052129726807058

Different results with SFS :

SFSwith max_depth=3 = 4.04314
cross_val_scorewith max_depth=3 = -4.05213

I don't understand why there is a difference here (pretty small but still a difference on this toy dataset)...

armgilles commented 2 years ago

Hi @zheng-ya !

I think you could find you answer in SFS and Gridsearch documentation.

Old thread but I'm quite impressed with the quality of the documentation @rasbt ! 🥇

rasbt commented 2 years ago

Hey @zheng-ya Hm, that's weird. On my computer they actually gave the same results.

rasbt commented 2 years ago

The best explanation I have for this, @zheng-ya, is that this might be a random seed issue since you are using k-fold CV. Maybe try to fix the random seed and run it again. I.e., you can try the following:

from sklearn.model_selection import KFold

my_cv = KFold(n_splits=3, random_state=123, shuffle=True)

sfs = SFS(cv=my_cv, ...)
cross_val_score(cv=my_cv, ...)

zheng-gt commented 2 years ago

@armgilles Thank you. I will check it and compare it with the for-loop.

zheng-gt commented 2 years ago

@rasbt Great explanation. Thank you very much for your patience and time!