Open rasbt opened 6 years ago
I have a dataset with 54 features. I am observing that using k_features = (25, 30) evaluates the model till all 54 features. Not logging a separate issue at the moment.
You can see here if you can get access:
https://www.kaggle.com/phsheth/ensemble-sequential-backward-selection?scriptVersionId=20009920
Hm that's weird and shouldn't happen. I just ran a quick example and couldn't reproduce this issue:
E.g., for backward selection:
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import numpy as np
np.random.seed(123)
X = np.random.random((100, 50))
y = np.zeros((100)).astype(int)
y[50:] = 1
knn = KNeighborsClassifier(n_neighbors=3)
sfs1 = SFS(knn,
k_features=(20, 30),
forward=False,
floating=False,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
print('All feature subset sizes:', sfs1.subsets_.keys())
it returns:
Size of best selected subbset: 26
All feature subset sizes: dict_keys([50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20])
And for forward selection, it also seems fine:
sfs1 = SFS(knn,
k_features=(20, 30),
forward=True,
floating=True,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
print('All feature subset sizes:', sfs1.subsets_.keys())
returns
Size of best selected subbset: 23
All feature subset sizes: dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])
I tried the floating variants as well and there is no issue. How did you run the SFS exactly, and what version of mlxtend are you using?
You can check via
import mlxtend
mlxtend.__version__
Using version '0.17.0' (imports directly on kaggle kernel - i did not have to add the mlxtend library there)
seqbacksel_rf = SFS(classifier_rf, k_features = (25, 30),
forward = False, floating = False,
scoring = 'accuracy', cv = 5,
n_jobs = -1)
seqbacksel_rf = seqbacksel_rf.fit(train_X, train_y.values.ravel())
print('best combination (ACC: %.3f): %s\n' % (seqbacksel_rf.k_score_, seqbacksel_rf.k_feature_idx_))
print('all subsets:\n', seqbacksel_rf.subsets_)
plot_sfs(seqbacksel_rf.get_metric_dict(), kind='std_err');
/opt/conda/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
best combination (ACC: 0.886): (0, 3, 4, 5, 6, 7, 9, 10, 11, 12, 14, 15, 16, 17, 19, 23, 27, 29, 30, 31, 35, 36, 42, 43, 44, 45, 46, 50, 51)
all subsets:
...
I can't spot an issue in the example above, it looks all fine to me. So, based on the plot above, I would expect the SFS to return a subset of size 29, is this correct? You can double-check via the following code:
print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
(it should print 29, which would then be within the k_features = (25, 30) range you specified)
ok sir. my apologies. I thought the code meant mlxtend should not evaluate the model using features above 30. It does evaluate but does not report a subset above 30 features.
Oh, maybe this was a misunderstanding then.
Say you set k_features=(25, 30)
.
hope this addresses the issue!?
I understand now sir. Thanks for clarifying. I did read about backward selection but the fundamental got lost somewhere from my mind.
No worries, and I am glad to hear that there's no bug :)
For ease of use and versatility, the
k_features
parameter should be changed toRegarding
feature_range
, if "all" is selected, feature subsets of all sizes will be considered as candidates for the best feature subset selected based on what's specified underrecipe
.Regarding
recipe
, if "best" is provided, the feature selector will return the feature subset with the best cross-validation performance. If "parsimonious" is provided as an argument, the smallest feature subset that is within one standard error of the cross-validation performance will be selected.I.e., if
feature_range=(3, 5)
andrecipe='best'
, the best feature subset with the best performance will be selected, and this feature subset can either have 3, 4, or 5 features.Note that it would be best to deprecate
k_features
and default it toNone
. However ifk_features
is notNone
, it should have priority over the new parameters to avoid breaking existing code bases.