rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.86k stars 857 forks source link

Replace `k_features` in `SequentialFeatureSelector` by `feature_range` and `recipe` #261

Open rasbt opened 6 years ago

rasbt commented 6 years ago

For ease of use and versatility, the k_features parameter should be changed to

Regarding feature_range, if "all" is selected, feature subsets of all sizes will be considered as candidates for the best feature subset selected based on what's specified under recipe.

Regarding recipe, if "best" is provided, the feature selector will return the feature subset with the best cross-validation performance. If "parsimonious" is provided as an argument, the smallest feature subset that is within one standard error of the cross-validation performance will be selected.

I.e., if feature_range=(3, 5) and recipe='best', the best feature subset with the best performance will be selected, and this feature subset can either have 3, 4, or 5 features.

Note that it would be best to deprecate k_features and default it to None. However if k_features is not None, it should have priority over the new parameters to avoid breaking existing code bases.

phsheth commented 5 years ago

I have a dataset with 54 features. I am observing that using k_features = (25, 30) evaluates the model till all 54 features. Not logging a separate issue at the moment.

You can see here if you can get access:

https://www.kaggle.com/phsheth/ensemble-sequential-backward-selection?scriptVersionId=20009920

https://www.kaggleusercontent.com/kf/20009920/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..vycxHb6-clsD7gXxIahuMA.9w0WyTfCh3sk2L-MkLCfIQOR5LIF-Hd_mSo5ivqZT2Pv556biCiHi7dRiaJL4rlXjwFFyboAF-vLrSU98hBJbeCiaWY7v0DqwIjnHg-51CSfgQe4Djy_MwHTdLQ4FgtTatJXG83GLuoK_8mDx4j0FVas4ZoxA7YperBIBjiuaLA.g1QNBOoCLpFT-NEBXL800g/__results___files/__results___16_2.png

rasbt commented 5 years ago

Hm that's weird and shouldn't happen. I just ran a quick example and couldn't reproduce this issue:

E.g., for backward selection:

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import numpy as np

np.random.seed(123)

X = np.random.random((100, 50))
y = np.zeros((100)).astype(int)
y[50:] = 1

knn = KNeighborsClassifier(n_neighbors=3)

sfs1 = SFS(knn, 
           k_features=(20, 30), 
           forward=False, 
           floating=False, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X, y)

print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
print('All feature subset sizes:', sfs1.subsets_.keys())

it returns:

Size of best selected subbset: 26
All feature subset sizes: dict_keys([50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20])

And for forward selection, it also seems fine:

sfs1 = SFS(knn, 
           k_features=(20, 30), 
           forward=True, 
           floating=True, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X, y)

print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
print('All feature subset sizes:', sfs1.subsets_.keys())

returns

Size of best selected subbset: 23
All feature subset sizes: dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])

I tried the floating variants as well and there is no issue. How did you run the SFS exactly, and what version of mlxtend are you using?

You can check via

import mlxtend
mlxtend.__version__
phsheth commented 5 years ago

Using version '0.17.0' (imports directly on kaggle kernel - i did not have to add the mlxtend library there)

seqbacksel_rf = SFS(classifier_rf, k_features = (25, 30),
                    forward = False, floating = False,
                    scoring = 'accuracy', cv = 5, 
                    n_jobs = -1)
seqbacksel_rf = seqbacksel_rf.fit(train_X, train_y.values.ravel())

print('best combination (ACC: %.3f): %s\n' % (seqbacksel_rf.k_score_, seqbacksel_rf.k_feature_idx_))
print('all subsets:\n', seqbacksel_rf.subsets_)
plot_sfs(seqbacksel_rf.get_metric_dict(), kind='std_err');
/opt/conda/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning

best combination (ACC: 0.886): (0, 3, 4, 5, 6, 7, 9, 10, 11, 12, 14, 15, 16, 17, 19, 23, 27, 29, 30, 31, 35, 36, 42, 43, 44, 45, 46, 50, 51)

all subsets:
...

sfs_gh

rasbt commented 5 years ago

I can't spot an issue in the example above, it looks all fine to me. So, based on the plot above, I would expect the SFS to return a subset of size 29, is this correct? You can double-check via the following code:

print('Size of best selected subbset:', len(sfs1.k_feature_idx_))

(it should print 29, which would then be within the k_features = (25, 30) range you specified)

phsheth commented 5 years ago

ok sir. my apologies. I thought the code meant mlxtend should not evaluate the model using features above 30. It does evaluate but does not report a subset above 30 features.

rasbt commented 5 years ago

Oh, maybe this was a misunderstanding then.

Say you set k_features=(25, 30).

hope this addresses the issue!?

phsheth commented 5 years ago

I understand now sir. Thanks for clarifying. I did read about backward selection but the fundamental got lost somewhere from my mind.

rasbt commented 5 years ago

No worries, and I am glad to hear that there's no bug :)