Closed armgilles closed 5 years ago
Not sure, but maybe only 1 feature gets selected so that the max depth doesn't have an effect? Can you run the grid search 2 times seperately, one time with
param_grid = [{
'sfs__estimator__regressor__max_depth' : [1]
}]
and one time with
param_grid = [{
'sfs__estimator__regressor__max_depth' : [3]
}]
And then maybe check the sfs.k_feature_idx_
of the resulting estimators to get a better idea of what's going on?
Running sfs__estimator__regressor__max_depth
failed :
param_grid = [{
'sfs__estimator__regressor__max_depth' : [1]
}]
GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
n_jobs=-1, scoring='neg_mean_absolute_error',
return_train_score=True,
verbose=True, refit=True)
GDCV.fit(X.values, y)
ValueError: Invalid parameter regressor for estimator RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
I have to use sfs__estimator__max_depth
from SequentialFeatureSelector
.
param_grid = [{
'sfs__estimator__max_depth' : [1]
}]
GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
n_jobs=-1, scoring='neg_mean_absolute_error',
return_train_score=True,
verbose=True, refit=True)
GDCV.fit(X.values, y)
cv_result = pd.DataFrame(GDCV.cv_results_)
cv_result
param_grid = [{
'sfs__estimator__max_depth' : [3]
}]
GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
n_jobs=-1, scoring='neg_mean_absolute_error',
return_train_score=True,
verbose=True, refit=True)
GDCV.fit(X.values, y)
cv_result = pd.DataFrame(GDCV.cv_results_)
cv_result
Both have the same result.
GDCV.best_estimator_.named_steps['sfs'].k_feature_idx_
# (0, 3, 4, 5, 6, 9, 11, 12)
It seems sfs
works but it doesn't appear in the GridSearchCV
report.
GDCV.best_estimator_.named_steps['sfs'].transform(X)[0:2]
array([[6.320e-03, 0.000e+00, 5.380e-01, 6.575e+00, 6.520e+01, 2.960e+02,
3.969e+02, 4.980e+00],
[2.731e-02, 0.000e+00, 4.690e-01, 6.421e+00, 7.890e+01, 2.420e+02,
3.969e+02, 9.140e+00]])
The error ValueError: Invalid parameter regressor for estimator RandomForestRegressor
is probably caused by taking out the TransformedTargetRegressor
step. Did you do that?
I took that out, and was able to reproduce your k_feature_idx_
result. With it still in there, using sfs__estimator__regressor__max_depth
of 1 or 3 produces the same answer: k_feature_idx_ = (0, 5, 10, 11, 12)
. That is why the final result doesn't change -- the same features are selected regardless of max_depth.
Good catch @pnb
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import GridSearchCV, cross_val_score
import pandas as pd
import numpy as np
RANDOM_SEED = 42
boston = load_boston()
data, y = load_boston(return_X_y=True)
X = pd.DataFrame(data, columns=boston.feature_names)
rf = RandomForestRegressor(random_state=RANDOM_SEED,
n_estimators=10)
clf = TransformedTargetRegressor(regressor=rf,
func=np.log1p,
inverse_func=np.expm1)
sfs = SFS(clf,
k_features=(1, X.shape[1]),
forward=False,
floating=True,
scoring='neg_mean_absolute_error',
verbose=1,
n_jobs=-1,
cv=3)
pipe_clf = Pipeline(steps=[('sfs', sfs),
('clf', clf)])
param_grid = [{
'sfs__estimator__regressor__max_depth' : [3],#[1, 3]
# 'clf__regressor__max_depth' : [3]#[1, 3]
}]
GDCV = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, cv=3,
n_jobs=-1, scoring='neg_mean_absolute_error',
return_train_score=True,
verbose=False, refit=True)
GDCV.fit(X.values, y)
GDCV.best_score_
# -3.8559436923367265
GDCV.best_estimator_.named_steps['sfs'].k_feature_idx_
# (0, 5, 8, 9, 10, 11, 12)
# Checking our features selection SFS report
features_select_report = pd.DataFrame.from_dict(GDCV.best_estimator_.named_steps['sfs'].get_metric_dict()).T
features_select_report.sort_values('avg_score', ascending=0)
GridSearchCV
gives us a score of -3.85594
while SequentialFeatureSelector
best score with features selection (on the same algo and same validation strategy) gives us -3.69947
Seems weird no?
Checking best features selection on cross_val_score
:
# from GDCV.best_estimator_.named_steps['sfs'].k_feature_idx_
best_features = ['CRIM', 'RM', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
# Same algo as previously
rf_best = RandomForestRegressor(random_state=RANDOM_SEED,
n_estimators=10,
max_depth=3)
clf_best = TransformedTargetRegressor(regressor=rf_best,
func=np.log1p,
inverse_func=np.expm1)
cv_score_best = cross_val_score(clf_best,
X[best_features].values, y, cv=3,
scoring='neg_mean_absolute_error')
print(cv_score_best)
# [-2.3679384 -4.97948918 -5.40380292]
print(cv_score_best.mean())
# -4.250410167013302
So GridSearchCV
(best score) != SequentialFeatureSelector
(best score) != cross_val_score
based on same data / validation strategy / algo / params
Did I miss something ?
Hi @armgilles,
It looks to me like the validation strategies are not actually the same. What you have is:
SequentialFeatureSelector
: MAE comes from nested cross-validation within training data provided from GridSearchCV (i.e., 3-fold within 3-fold)
GridSearchCV
: 3-fold cross-validation with features chosen via nested cross-validation, but MAE evaluated on model built from 2/3 of datacross_val_score
: 3-fold cross-validation with features chosen from all folds of the data
I could be mistaken about how this is actually working, but that's how it appears to me from reading your code.
Hum I see...
So my data
(all my learning dataset) is first cut by GridSearchCV
(cv=3
) :
data
-> X_train_grid
(2/3) & X_test_grid
(1/3)Then with SequentialFeatureSelector
our dataset from GridSearchCV
split is cut again (cv=3
):
X_train_grid
-> X_train_sfs
(2/3 of X_train_grid
from GridSearchCV
) & X_test_sfs
(1/3)Learn from X_train_grid
and test on X_test_grid
Learn from X_train_sfs
(2/3 of X_train_grid
) and test on X_test_sfs
How can I look for best features selection given a a specific GridSearchCV
params list based on a Cross-Validation using all our dataset (not a subset of a subset of our data) ?
Manual loop on GridSearchCV
params list and fit SequentialFeatureSelector
to log result (concat all sfs.get_metric_dict()
with current GridSearchCV
params) to find best combinaison of estimator.get_params().keys()
and features selection.
I'm not 100% sure I understand the question, but I have a couple ideas in case they are aligned with what you're trying.
Is your goal to determine the best features using the entire training set? I would not recommend that, but if you wanted to you could use the custom cross-validation option, and implement a custom model selection data splitter (or use a list of indices) which would provide all data as both train and test indices. This could over-fit your selection of features to the training data quite a bit, which will not invalidate your result (since no testing set information is leaked) but will probably reduce accuracy.
Is your goal instead to find the best features using the same parameters for both fitting the classifier for feature selection AND fitting the classifier for testing accuracy, to avoid grid searching once for feature selection and once model selection? If so I don't know of a super easy way, but using the memory
argument for the Pipeline
can save a ton of time by converting the complexity from k^2 to 2k, for k hyperparameter combinations.
Is your goal to find the best features by training clf
on the training data and evaluating how good the set of features is on the testing data? If so you could simply run sfs.fit_transform(X)
, which would run your 3-fold cross-validation, training on 2/3 of X
and testing on the remaining 1/3 for three iterations. However, this would be sort of useless for most uses since the resulting list of features would include leaked testing set information.
I think it's 1.
. Validation strategy is a exciting area and I fear bad / random / unlucky / lucky test set.
To summarize my problem :
Find the best features combinaison with a given list of params for a given estimator.
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import TransformedTargetRegressor
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
# Create my data
RANDOM_SEED = 42
boston = load_boston()
data, y = load_boston(return_X_y=True)
X = pd.DataFrame(data, columns=boston.feature_names)
# List of params
max_depth_params = [3, 4]
# To store results from multiple SFS
features_select_report = pd.DataFrame()
# Loop on my params
for i_max_depth in max_depth_params:
print("Features selection with max_depth : " + str(i_max_depth))
rf = RandomForestRegressor(random_state=RANDOM_SEED,
n_estimators=10,
max_depth=i_max_depth)
clf = TransformedTargetRegressor(regressor=rf,
func=np.log1p,
inverse_func=np.expm1)
sfs = SFS(clf,
k_features=(1, X.shape[1]),
forward=False,
floating=True,
scoring='neg_mean_absolute_error',
verbose=0,
n_jobs=-1,
cv=3)
sfs.fit(X.values, y)
# Store result
features_select_report_temp = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
features_select_report_temp['i_max_depth'] = i_max_depth
features_select_report = pd.concat([features_select_report, features_select_report_temp])
features_select_report.sort_values('avg_score', ascending=0).head(10)
Best score is 3.93827
with max_depth=4
and k_feature_idx_= (0, 1, 2, 3, 4, 9, 12)
. max_depth=3
is in 10th position (-4.04314
).
With only one param (max_depth
), it's pretty straightforward but can be more complex with many of them (job of GridSearchCV
).
# Best features selection for max_depth=4
# (0, 1, 2, 3, 4, 9, 12)
best_features = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'TAX', 'LSTAT']
rf = RandomForestRegressor(random_state=RANDOM_SEED,
n_estimators=10,
max_depth=4)
clf = TransformedTargetRegressor(regressor=rf,
func=np.log1p,
inverse_func=np.expm1)
cv_score = cross_val_score(clf,
X[best_features].values, y, cv=3,
scoring='neg_mean_absolute_error')
print(cv_score)
# [-2.73943653 -5.219839 -3.85554654]
print(cv_score.mean())
# -3.9382740221213233
Same result as SFS
for this params (good)
# Best features selection for max_depth=3
# (0, 1, 8, 10, 12)
best_features = ['CRIM', 'ZN', 'RAD', 'PTRATIO', 'LSTAT']
rf = RandomForestRegressor(random_state=RANDOM_SEED,
n_estimators=10,
max_depth=3)
clf = TransformedTargetRegressor(regressor=rf,
func=np.log1p,
inverse_func=np.expm1)
cv_score = cross_val_score(clf,
X[best_features].values, y, cv=3,
scoring='neg_mean_absolute_error')
print(cv_score)
# [-2.53444967 -5.2265473 -4.39539221]
print(cv_score.mean())
# -4.052129726807058
Different results with SFS
:
SFS
with max_depth=3
= 4.04314
cross_val_score
with max_depth=3
= -4.05213
I don't understand why there is a difference here (pretty small but still a difference on this toy dataset)...
My guess is there is some differences in random seeds, resulting in different cross-validation folds. If you put np.random.seed(RANDOM_SEED)
directly before every time calling cross_val_score
and re-run everything do the results come out the same?
I also tried copying and pasting your code (the first of the three you posted just now) into a new python file, ran it, and came up with avg_score
= -4.50213 in the 10th row of the DataFrame, which differs from your DataFrame and matches your final result. That also suggests maybe the random seed got changed for the KFold
that cross_val_score
is doing internally, if you ran the code more than once for example.
Also, just to reiterate: these cross_val_score
outputs are not really valid, since you are overfitting to the testing data. But that is beside the point a bit.
cross_val_score
don't have random_state
(it seems deterministic link).
I add np.random.seed(RANDOM_SEED)
and run multi-time everything. Got the same result (from my last post).
Now using KFold to fix my CV split in SFS
& cross_val_score
:
from sklearn.model_selection import KFold
my_kf = KFold(n_splits=2, random_state=RANDOM_SEED)
-3.21061
with max_depth=4
and k_feature_idx_= (0, 3, 4, 5, 7, 10, 12)
max_depth=3
best score is now at 12th with a score at -3.46979
and k_feature_idx_= (0, 3, 4, 5, 10, 12)
I can now reproduce SFS
& cross_val_score
(code part 2 & 3 from my last post) (WIN) !
Difference may come from SFS
's CV implementation and cross_val_score
(weird) ?
I test the code on Mac OS & Linux, have some little change in my results (du to different versions of compiler GCC I think).
I don't get the point of overfitting on cross_val_score
(maybe I'm tired). On cv=3
I will learn and test on 3 different parts of my entire data (X
), .mean()
the result is better (in theory) than splitting your data in X_train
/ X_test
and could overfit on your X_test
(luck, no outlier, etc.) no ?
Sorry to disturb & thanks for your past answers, it's more a use case of reproduced results (and believe in it 👍 ).
Good call that shuffle = False
in KFold
by default. As for differences in CV implementation, however, I don't think there are many. From what I can tell in the source code, SFS uses the sklearn K-fold implementation.
Differences between OSX/Linux could also be due to differing versions of packages, like numpy or sklearn or any of the other pieces that might slightly influence the random numbers that get generated. Or it might be compiler too as you suggested.
The validity issue with your part 1 code is that you are trying multiple values of max_depth
without a separate validation set to ensure that you are not picking the value based on testing set performance. With 2 values the risk is minimal, but with 100 or 1000 hyperparameter options you will surely find some that work well purely by chance in your testing data, but don't generalize well to new data. This is sort of like p-hacking in statistics. The model parameters are not themselves overfitted, but the model selection process is (the hyperparameters).
Similarly, the validity issue with part 2 and 3 is that you have selected the best features based on how well they worked on the testing set. With lots of features you will surely find some features that work well on the testing set purely by chance, but don't generalize. To properly cross-validate, you can't use any information from the testing set -- not even how well something worked. As a consequence, if you want to try multiple sets of hyperparameters (usually a good idea), then you need nested cross-validation.
Sorry for the late response here ... it's been a busy semester and I haven't had a chance to thoroughly read through this thread. However, I did some code cleanup/rewrite regarding the get_params and set_params inside the SequentialFeatureSelector (#529) which may address a potential issue there if it existed.
@armgilles Hi, Armand, I am very happy to find your post, and I am facing the same problem as you when using the GridSearchCV & SequentialFeatureSelector to find the best features combination with a given list of params (n_estimators and max_depth) for a random forest estimator.
I noticed that you combined for-loop and SequentialFeatureSelector to get the best features combination but I didn't get the point of your later discussion about the validity issue.
My question is that is the following process appropriate?
I think it's
1.
. Validation strategy is a exciting area and I fear bad / random / unlucky / lucky test set.To summarize my problem :
Find the best features combinaison with a given list of params for a given estimator.
from sklearn.datasets import load_boston from sklearn.ensemble import RandomForestRegressor from sklearn.compose import TransformedTargetRegressor from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.model_selection import cross_val_score import pandas as pd import numpy as np # Create my data RANDOM_SEED = 42 boston = load_boston() data, y = load_boston(return_X_y=True) X = pd.DataFrame(data, columns=boston.feature_names) # List of params max_depth_params = [3, 4] # To store results from multiple SFS features_select_report = pd.DataFrame() # Loop on my params for i_max_depth in max_depth_params: print("Features selection with max_depth : " + str(i_max_depth)) rf = RandomForestRegressor(random_state=RANDOM_SEED, n_estimators=10, max_depth=i_max_depth) clf = TransformedTargetRegressor(regressor=rf, func=np.log1p, inverse_func=np.expm1) sfs = SFS(clf, k_features=(1, X.shape[1]), forward=False, floating=True, scoring='neg_mean_absolute_error', verbose=0, n_jobs=-1, cv=3) sfs.fit(X.values, y) # Store result features_select_report_temp = pd.DataFrame.from_dict(sfs.get_metric_dict()).T features_select_report_temp['i_max_depth'] = i_max_depth features_select_report = pd.concat([features_select_report, features_select_report_temp]) features_select_report.sort_values('avg_score', ascending=0).head(10)
Best score is
3.93827
withmax_depth=4
andk_feature_idx_= (0, 1, 2, 3, 4, 9, 12)
.max_depth=3
is in 10th position (-4.04314
).With only one param (
max_depth
), it's pretty straightforward but can be more complex with many of them (job ofGridSearchCV
).Checking result from SFS
max_dept=4
# Best features selection for max_depth=4 # (0, 1, 2, 3, 4, 9, 12) best_features = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'TAX', 'LSTAT'] rf = RandomForestRegressor(random_state=RANDOM_SEED, n_estimators=10, max_depth=4) clf = TransformedTargetRegressor(regressor=rf, func=np.log1p, inverse_func=np.expm1) cv_score = cross_val_score(clf, X[best_features].values, y, cv=3, scoring='neg_mean_absolute_error') print(cv_score) # [-2.73943653 -5.219839 -3.85554654] print(cv_score.mean()) # -3.9382740221213233
Same result as
SFS
for this params (good)max_dept=3
# Best features selection for max_depth=3 # (0, 1, 8, 10, 12) best_features = ['CRIM', 'ZN', 'RAD', 'PTRATIO', 'LSTAT'] rf = RandomForestRegressor(random_state=RANDOM_SEED, n_estimators=10, max_depth=3) clf = TransformedTargetRegressor(regressor=rf, func=np.log1p, inverse_func=np.expm1) cv_score = cross_val_score(clf, X[best_features].values, y, cv=3, scoring='neg_mean_absolute_error') print(cv_score) # [-2.53444967 -5.2265473 -4.39539221] print(cv_score.mean()) # -4.052129726807058
Different results with
SFS
:
SFS
withmax_depth=3
=4.04314
cross_val_score
withmax_depth=3
=-4.05213
I don't understand why there is a difference here (pretty small but still a difference on this toy dataset)...
Hi @zheng-ya !
I think you could find you answer in SFS and Gridsearch documentation.
Old thread but I'm quite impressed with the quality of the documentation @rasbt ! 🥇
Hey @zheng-ya Hm, that's weird. On my computer they actually gave the same results.
The best explanation I have for this, @zheng-ya, is that this might be a random seed issue since you are using k-fold CV. Maybe try to fix the random seed and run it again. I.e., you can try the following:
from sklearn.model_selection import KFold
my_cv = KFold(n_splits=3, random_state=123, shuffle=True)
sfs = SFS(cv=my_cv, ...)
cross_val_score(cv=my_cv, ...)
@armgilles Thank you. I will check it and compare it with the for-loop.
@rasbt Great explanation. Thank you very much for your patience and time!
Hey @rasbt
I'm strangling to find the best features & tuning using
SequentialFeatureSelector
andGridSearchCV
.I would like to test for each of my
param_grid
the best combinaison of features.Code to reproduce :
max_depth
params don't changemean_train_score
andmean_test_score
and don't have bestk_features
withsfs
.What I am missing ?