rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.86k stars 857 forks source link

Take the best performance during a Sequential Feature Selector with Pipeline process #41

Closed armgilles closed 8 years ago

armgilles commented 8 years ago

Hi Sebastian,

I posted an issue after some tweet with you (I hope it could help other people).

I would like to perform a Sequential Feature Selector (SFS) with Pipeline. But at the end of the process, SFS takes SFS.k_features (25 for this exemple) :

clf1 = LogisticRegression(class_weight='balanced', solver='newton-cg', C=100.0, random_state=17)

sfs1 = SFS(clf1, 
           k_features=25, 
           forward=True, 
           floating=False, 
           scoring='roc_auc',
           cv=5)
sfs1 = sfs1.fit(data.values, y.values)

clf1_pipe = Pipeline([('sfs1', sfs1),
                      ('Logistic Newton', clf1)])

print clf1_pipe.named_steps['sfs1'].k_feature_idx_
# (0, 1, 3, 4, 5, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21, 22, 23, 25, 27, 29, 30, 31, 34, 35)

The score clf1_pipe.named_steps['sfs1'].k_score_is 0.6956081 but it is not the best score (performance) we got. In fact for we have a better score with 10 features :

result_clf1_pipe = pd.DataFrame.from_dict(clf1_pipe.named_steps['sfs1'].get_metric_dict(confidence_interval=0.90)).T
result_clf1_pipe.sort_values('avg_score', ascending=0, inplace=True)
result_clf1_pipe.head()

image

Can we during the Pipeline process get automatically the feature selection corresponding to the best performance ?

You can find the nootbook with the pipeline process SFS ("Using Pipieline to do it").

Edit : I manually research the best number of k_features for SFS for all my Estimators. Then I plug them in a EnsembleVoteClassifier. The result is not what I expected (see : "Find Manually the best k_features for SFS and fit our ensemble ")

rasbt commented 8 years ago

Hi, Armand, wow, this is a lengthy notebook. I am happy to go over it, but maybe let me try to address your question directly first -- let me know if I understand correctly :). So, essentially, you want to determine the "best features" of a classifier separately, outside the grid search procedure? And then, you want to use these features for a specific classifier inside the EnsembleVote classifier? E.g., classifier 1-3 get all features, but classifier 4 only gets the feature indexes 1 & 4? I added another example to the documentation, at the very bottom of Example 3 (http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/#example-3-majority-voting-with-classifiers-trained-on-different-feature-subsets); the paragraph starting with:

Furthermore, we can fit the SequentialFeatureSelector separately, outside the grid search hyperparameter optimization pipeline ...

Is this what you were asking for? :)

armgilles commented 8 years ago

(Pretty long post)

Thanks for your response & your example ! Well my problem is a bit tricky i'll try to resume it :

I want to feed an ensemble using 2 model (model_A & model_B). For each of my model I want to reach the best score possible (AUC). So I have to play with features selection (SFS) and parameters of my model (GridsearchCV).

# Model_A
model_A = OneModel(random_state=42)
sfs_A = SequentialFeatureSelector(model_A, 
                                 k_features=len(data.columns),
                                 forward=True, 
                                 floating=False, 
                                 scoring='roc_auc',
                                 print_progress=False,
                                 cv=5)

clfA_pipe = Pipeline([('sfs_A', sfs_A),
                      ('model_A', model_A)])
# Model_B
model_B = OtherModel(random_state=42)
sfs_B = SequentialFeatureSelector(model_B, 
                                 k_features=len(data.columns),
                                 forward=True, 
                                 floating=False, 
                                 scoring='roc_auc',
                                 print_progress=False,
                                 cv=5)

clfB_pipe = Pipeline([('sfs_B', sfs_B),
                      ('model_B', model_B)])
# Ensemble
eclf = EnsembleVoteClassifier(clfs=[clfA_pipe, clfB_pipe,], voting='soft')

params = {'clfA_pipe__model_A__C': [1, 10, 50],
                 'clfA_pipe__model_A__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag'] 
                 'clfB_pipe__model_B__param1': [6, 8; 9],
                 'clfB_pipe__model_B__param2': [0.1, 0.5]}

grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5, scoring='roc_auc')
grid.fit(data, y)

I think it is the correct process but I have some problem with SequentialFeatureSelector (SFS). I want to search the best features (so i give the total number of columns of my dataset to k_features). SFS return with k_featureidx columns idx, but if you look into SFS.get_metric_dict() you can find better than SFS.k_score_ (cf screen of dataset in my previous post).

So SFS.k_feature_idx_ doesn't return the best features selection, it return the feature selection for the k_features.

One way to do it will be to pass in my Gridsearchcv params two parameters :

But it will be a bit heavy...

As you mentioned, i can use * ColumnSelector* and "fix the problem" with SFS.k_feature_idx_ I use :

col_sel = ColumnSelector(cols=sfs.get_metric_dict()[sorted(sfs.get_metric_dict().keys(), key=lambda x: (sfs.get_metric_dict()[x]['avg_score']), reverse=True)[0]]['feature_idx'])

Petty ugly but it works :)

But i'm still stuck when i try to use it with GridsearchCV process.

I update my notebook with your advices. You can find at :

Sorry for this long post... I hope I do not bother you too much with this.

rasbt commented 8 years ago

@armgilles Sorry, somehow I got busy with other things and overlooked your update here! Unfortunately, the notebook is offline now I guess. However, let me help you tackle these issues if it is still useful

I think it is the correct process but I have some problem with SequentialFeatureSelector (SFS). I want to search the best features (so i give the total number of columns of my dataset to k_features). SFS return with k_featureidx columns idx, but if you look into SFS.get_metric_dict() you can find better than SFS.kscore (cf screen of dataset in my previous post).

Hm, if you set k_features to the entire set of columns, the feature selector will return the complete dataset since there won't be any feature selection if "k == len(data.columns)"

But yeah, like you said, there's at least one hacky work around using the column selector that you mentioned ... This is actually a really useful one, I will add it to the documentation if you don't mind.

Any chance you could upload the notebook again?

armgilles commented 8 years ago

Hey @rasbt don't worry for that.

I re upload the notebook. I'm proud if I can help to improve the documentation.

If you have question on the notebook do not hesitate !

rasbt commented 8 years ago

So SFS.k_featureidx doesn't return the best features selection, it return the feature selection for the k_features.

Oh, I think I finally understand what you mean now!

Let me say this in simpler terms just to make sure I got your point :P:

Let's say we have a dataset consisting of 4 features. You set k_features to 2. Now, during the "sfs," it records that the features 2,3,4 together result the best performance; it returns only 2 features (e.g., 2 & 3) though, since that's what you asked for via k_features=2. However, you really want this "best" combination that was discovered during the sfs.

I think that's a cool idea and a neat tweak to the original algorithm. We could add an argument like k_features="best" so that the algorithms goes from max to min (in backward sel.) or min to max (in forward sel.) and returns the feature subset that had the best performance during this process. What do you think about it?

armgilles commented 8 years ago

Yeah you got my point.

It is pretty expensive computation, but give you the best result for a given estimator.

rasbt commented 8 years ago

Sounds useful to me though. And it should be pretty straight forward to add, probably won't take more than 10-15 minutes. Will take a look at it now :)

armgilles commented 8 years ago

If i can help just ping me :)

rasbt commented 8 years ago

Thanks, will do, but it should really be a pretty quick thing :).

rasbt commented 8 years ago

Instead of just allowing a "best" argument to the k_features param, I thought that it may be even more useful to define a tuple. E.g., k_features(1, 3) will consider all feature subsets that have length 1, 2, or 3, and select the best out of these. In practice:

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

knn = KNeighborsClassifier(n_neighbors=3,
                           k_features=(1, 3), 
                           forward=True, 
                           floating=False, 
                           scoring='accuracy',
                           cv=0)
sfs1 = sfs1.fit(X, y)

print('best combination (ACC: %.3f): %s' % (sfs1.k_score_, sfs1.k_feature_idx_))
print('all subsets:\n', sfs1.subsets_)

best combination (ACC: 0.967): (1, 2, 3)
all subsets:
 {1: {'feature_idx': (3,), 'cv_scores': array([ 0.96666667,  0.93333333,  0.93333333,  0.93333333,  1.        ]), 'avg_score': 0.95333333333333337}, 2: {'feature_idx': (2, 3), 'cv_scores': array([ 0.96666667,  0.96666667,  0.96666667,  0.93333333,  0.96666667]), 'avg_score': 0.95999999999999996}, 3: {'feature_idx': (1, 2, 3), 'cv_scores': array([ 0.96666667,  0.96666667,  0.96666667,  0.96666667,  0.96666667]), 'avg_score': 0.96666666666666656}}

What do you think? :)

armgilles commented 8 years ago

Yeah that great !

Maybe find another dataset for documentation, I tired of the Iris dataset (to few features). Maybe use Boston dataset.

rasbt commented 8 years ago

Haha yeah, why not :). I just use this iris thing since it's line and fast and doesn't require scaling in KNN (all cm), but yeah, it's booooring ;)

vickykhan89 commented 4 years ago

I am using SFS, And the selection features are = 10, but the algorithm so slow. Any suggestions please

rasbt commented 4 years ago

Hm,

vickykhan89 commented 4 years ago

Hi sir, Thank you for your response. follow below please: 1: RandomForestClassifier(n_jobs=-1,n_estimators=5) 2: Dataset is load_breast_cancer. 3: cv=5 4: n_jobs=-1

rasbt commented 4 years ago

Hm. How you you define slow then? It shouldn't take more than a few seconds or minutes. Also, I would recommend setting only one of the n_jobs to -1, not both.

E.g.,

vickykhan89 commented 4 years ago

Thank you Sir for you kind response. Its work faster with n_jobs=1.

rasbt commented 4 years ago

glad to hear!