Closed armgilles closed 8 years ago
Hi, Armand, wow, this is a lengthy notebook. I am happy to go over it, but maybe let me try to address your question directly first -- let me know if I understand correctly :). So, essentially, you want to determine the "best features" of a classifier separately, outside the grid search procedure? And then, you want to use these features for a specific classifier inside the EnsembleVote classifier? E.g., classifier 1-3 get all features, but classifier 4 only gets the feature indexes 1 & 4? I added another example to the documentation, at the very bottom of Example 3 (http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/#example-3-majority-voting-with-classifiers-trained-on-different-feature-subsets); the paragraph starting with:
Furthermore, we can fit the SequentialFeatureSelector separately, outside the grid search hyperparameter optimization pipeline ...
Is this what you were asking for? :)
(Pretty long post)
Thanks for your response & your example ! Well my problem is a bit tricky i'll try to resume it :
I want to feed an ensemble using 2 model (model_A & model_B). For each of my model I want to reach the best score possible (AUC). So I have to play with features selection (SFS) and parameters of my model (GridsearchCV).
# Model_A
model_A = OneModel(random_state=42)
sfs_A = SequentialFeatureSelector(model_A,
k_features=len(data.columns),
forward=True,
floating=False,
scoring='roc_auc',
print_progress=False,
cv=5)
clfA_pipe = Pipeline([('sfs_A', sfs_A),
('model_A', model_A)])
# Model_B
model_B = OtherModel(random_state=42)
sfs_B = SequentialFeatureSelector(model_B,
k_features=len(data.columns),
forward=True,
floating=False,
scoring='roc_auc',
print_progress=False,
cv=5)
clfB_pipe = Pipeline([('sfs_B', sfs_B),
('model_B', model_B)])
# Ensemble
eclf = EnsembleVoteClassifier(clfs=[clfA_pipe, clfB_pipe,], voting='soft')
params = {'clfA_pipe__model_A__C': [1, 10, 50],
'clfA_pipe__model_A__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag']
'clfB_pipe__model_B__param1': [6, 8; 9],
'clfB_pipe__model_B__param2': [0.1, 0.5]}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5, scoring='roc_auc')
grid.fit(data, y)
I think it is the correct process but I have some problem with SequentialFeatureSelector (SFS).
I want to search the best features (so i give the total number of columns of my dataset to k_features). SFS return with k_featureidx columns idx, but if you look into SFS.get_metric_dict()
you can find better than SFS.k_score_
(cf screen of dataset in my previous post).
So SFS.k_feature_idx_
doesn't return the best features selection, it return the feature selection for the k_features
.
One way to do it will be to pass in my Gridsearchcv params two parameters :
But it will be a bit heavy...
As you mentioned, i can use * ColumnSelector* and "fix the problem" with SFS.k_feature_idx_
I use :
col_sel = ColumnSelector(cols=sfs.get_metric_dict()[sorted(sfs.get_metric_dict().keys(), key=lambda x: (sfs.get_metric_dict()[x]['avg_score']), reverse=True)[0]]['feature_idx'])
Petty ugly but it works :)
But i'm still stuck when i try to use it with GridsearchCV process.
I update my notebook with your advices. You can find at :
Sorry for this long post... I hope I do not bother you too much with this.
@armgilles Sorry, somehow I got busy with other things and overlooked your update here! Unfortunately, the notebook is offline now I guess. However, let me help you tackle these issues if it is still useful
I think it is the correct process but I have some problem with SequentialFeatureSelector (SFS). I want to search the best features (so i give the total number of columns of my dataset to k_features). SFS return with k_featureidx columns idx, but if you look into SFS.get_metric_dict() you can find better than SFS.kscore (cf screen of dataset in my previous post).
Hm, if you set k_features
to the entire set of columns, the feature selector will return the complete dataset since there won't be any feature selection if "k == len(data.columns)"
But yeah, like you said, there's at least one hacky work around using the column selector that you mentioned ... This is actually a really useful one, I will add it to the documentation if you don't mind.
Any chance you could upload the notebook again?
Hey @rasbt don't worry for that.
I re upload the notebook. I'm proud if I can help to improve the documentation.
If you have question on the notebook do not hesitate !
So SFS.k_featureidx doesn't return the best features selection, it return the feature selection for the k_features.
Oh, I think I finally understand what you mean now!
Let me say this in simpler terms just to make sure I got your point :P:
Let's say we have a dataset consisting of 4 features. You set k_features
to 2. Now, during the "sfs," it records that the features 2,3,4 together result the best performance; it returns only 2 features (e.g., 2 & 3) though, since that's what you asked for via k_features=2
. However, you really want this "best" combination that was discovered during the sfs.
I think that's a cool idea and a neat tweak to the original algorithm. We could add an argument like k_features="best"
so that the algorithms goes from max to min (in backward sel.) or min to max (in forward sel.) and returns the feature subset that had the best performance during this process. What do you think about it?
Yeah you got my point.
It is pretty expensive computation, but give you the best result for a given estimator.
Sounds useful to me though. And it should be pretty straight forward to add, probably won't take more than 10-15 minutes. Will take a look at it now :)
If i can help just ping me :)
Thanks, will do, but it should really be a pretty quick thing :).
Instead of just allowing a "best" argument to the k_features
param, I thought that it may be even more useful to define a tuple. E.g., k_features(1, 3)
will consider all feature subsets that have length 1, 2, or 3, and select the best out of these. In practice:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3,
k_features=(1, 3),
forward=True,
floating=False,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
print('best combination (ACC: %.3f): %s' % (sfs1.k_score_, sfs1.k_feature_idx_))
print('all subsets:\n', sfs1.subsets_)
best combination (ACC: 0.967): (1, 2, 3)
all subsets:
{1: {'feature_idx': (3,), 'cv_scores': array([ 0.96666667, 0.93333333, 0.93333333, 0.93333333, 1. ]), 'avg_score': 0.95333333333333337}, 2: {'feature_idx': (2, 3), 'cv_scores': array([ 0.96666667, 0.96666667, 0.96666667, 0.93333333, 0.96666667]), 'avg_score': 0.95999999999999996}, 3: {'feature_idx': (1, 2, 3), 'cv_scores': array([ 0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667]), 'avg_score': 0.96666666666666656}}
What do you think? :)
Yeah that great !
Maybe find another dataset for documentation, I tired of the Iris dataset (to few features). Maybe use Boston dataset.
Haha yeah, why not :). I just use this iris thing since it's line and fast and doesn't require scaling in KNN (all cm), but yeah, it's booooring ;)
I am using SFS, And the selection features are = 10, but the algorithm so slow. Any suggestions please
Hm,
cv
?n_jobs
?Hi sir, Thank you for your response. follow below please: 1: RandomForestClassifier(n_jobs=-1,n_estimators=5) 2: Dataset is load_breast_cancer. 3: cv=5 4: n_jobs=-1
Hm. How you you define slow then? It shouldn't take more than a few seconds or minutes. Also, I would recommend setting only one of the n_jobs to -1, not both.
E.g.,
Thank you Sir for you kind response. Its work faster with n_jobs=1.
glad to hear!
Hi Sebastian,
I posted an issue after some tweet with you (I hope it could help other people).
I would like to perform a Sequential Feature Selector (SFS) with Pipeline. But at the end of the process, SFS takes
SFS.k_features
(25 for this exemple) :The score
clf1_pipe.named_steps['sfs1'].k_score_
is0.6956081
but it is not the best score (performance) we got. In fact for we have a better score with 10 features :Can we during the Pipeline process get automatically the feature selection corresponding to the best performance ?
You can find the nootbook with the pipeline process SFS ("Using Pipieline to do it").
Edit : I manually research the best number of
k_features
for SFS for all my Estimators. Then I plug them in a EnsembleVoteClassifier. The result is not what I expected (see : "Find Manually the best k_features for SFS and fit our ensemble ")