Closed hmf closed 8 years ago
Hi, @hmf . Sorry for the silence, just got back from SciPy & Texas; I will try to take a closer look at it on the weekend!
Thank you.
Hi, Hugo, hm, I am not completely sure why you are getting this error, but I think it may be related to passing Pandas DataFrames somewhere. I.e., to run your example code, I loaded the iris dataset and had no issues with that. So, may I suggest to just use something like
X, y = X.values, y.values
at the beginning of your code and adjust the following lines, e.g.,:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.datasets import load_iris
# get the unormalized data
#X = dy[ dy.columns.difference(['label']).values ]
#y = dy['label'].values
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
V = X_train.var()#.values
C = np.cov(X_train)#.values
CPI = np.linalg.pinv(C)
CI = np.linalg.inv(C)
# k_range : must be less than the training size. What happens if number of features > sample size
k_range = range(1, X.shape[1])
weights = ['uniform' , 'distance']
#algos_all = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all = ['ball_tree', 'kd_tree', 'brute']
algos = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]
# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:
seuclidean = {
'sfs__k_features' : list(range(1, X.shape[1])),
'sfs__estimator__metric' : ['seuclidean'],
'sfs__estimator__metric_params': [ {'V':V} ], # will be automatically calculated
'sfs__estimator__algorithm' : ['ball_tree'], # TODO , ['brute', 'ball_tree'],
'sfs__estimator__n_neighbors' : list(k_range),
'sfs__estimator__weights' : weights,
'sfs__estimator__leaf_size' : list(leaf_sizes) }
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend
# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())
sfs1 = SFS(estimator=knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
print_progress=False,
cv=5)
# !?!? n_jobs=-1)
pipe = Pipeline([
('standardize', preprocessing.MinMaxScaler()),
('sfs', sfs1),
('knn', knn)])
# See KNeighborsClassifier equivalent param_grid
param_grid = [
seuclidean
]
# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
#n_jobs=-1, for better stack tracing
cv=5,
verbose=1,
refit=True)
# Run the grid search
gs = gs.fit(X_train, y_train) #.values, y_train)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 2.5s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 10.4s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 25.2s
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed: 30.3s finished
I am closing this issue since it doesn't seem to be a mlxtend code-related issue, but please feel free to comment further on it :)
Cheers, Sebastian
Appreciate you looking into it. I copied and pasted your code into a cell to make sure I did not mess up. Unfortunately now I get:
[I 08:40:56.355 NotebookApp] KernelRestarter: restarting kernel (1/5) WARNING:root:kernel ca73f90f-7cd8-4c45-8d93-c146f8d30738 restarted [I 08:41:09.265 NotebookApp] Kernel shutdown: ca73f90f-7cd8-4c45-8d93-c146f8d30738
In addition to this I cannot reproduce my previous results. I usually update my virtualenv regularly, so it may be due to a chanage in some other package. I will have to look at this more carefully.
Apologies for the "noise".
Regards, HF
Hi Sebastian,
Something is definitely fishy here. So upon further investigation I found that the calculation of the variance using the var
function is not the same for panda
's DataFrame
and numpy
's array. More specifically if in numpy
you do not define the axis, then the variance is for the flattened matrix. This should result in an exception because the Iris dataset has 4 columns so the the variance V
should be an array of length 4. However, I get kernel failure (when using the ball tree metric; so I have done the tests here using the "brute" metric). To see the difference in the variance calculations you can use the code below:
import pandas as pd
print(X_train.var())
print()
print(pd.DataFrame(X_train).var())
print(X_train.var(axis=0))
And the output should be something like:
3.8504924375
0 0.692489 1 0.168905 2 3.101385 3 0.607257 dtype: float64 [ 0.685564 0.167216 3.070371 0.601184]`
So I reused the code you had. I now have 2 metric configurations: sfs_seuclidean
and seuclidean
. We use the pipe to send the data and parameters through the SFS with the first configuration or let it go directly to the KNN via the second configuration. The example (see code below) has pipe
with the SFS commented out and the param_grid
set to the second configuration. I get the following output:
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.1s
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed: 0.3s finished
If however I use the SFS configuration (uncomment the sfs
and use the alternate configuration ) the I get an exception (see end of message for full stack trace):
/home/hmf/my_py3/lib/python3.4/site-packages/scipy/spatial/distance.py in cdist(XA, XB, metric, p, V, VI, w)
2151 'one-dimensional.')
2152 if V.shape[0] != n:
-> 2153 raise ValueError('Variance vector V must be of the same '
2154 'dimension as the vectors on which the '
2155 'distances are computed.')
ValueError: Variance vector V must be of the same dimension as the vectors on which the distances are computed.
I am using Version: 0.4.1
of mlxtend
via a virtualenv
installation. Could you use the code below to check If the diagnosis above is correct?
TIA, Hugo
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend
# get the unormalized data
#X = dy[ dy.columns.difference(['label']).values ]
#y = dy['label'].values
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
#V = X_train.var()#.values
V = X_train.var(axis=0)
# k_range : must be less than the training size. What happens if number of features > sample size
k_range = range(1, X.shape[1])
weights = ['uniform' , 'distance']
#algos_all = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all = ['ball_tree', 'kd_tree', 'brute']
algos = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]
# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:
sfs_seuclidean = {
'sfs__k_features' : list(range(1, X.shape[1])),
'sfs__estimator__metric' : ['seuclidean'],
'sfs__estimator__metric_params': [ {'V':V} ], # will be automatically calculated
#'sfs__estimator__algorithm' : ['ball_tree'] #, # TODO , ['brute', 'ball_tree'],
'sfs__estimator__algorithm' : ['brute'], #, # TODO , ['brute', 'ball_tree'],
'sfs__estimator__n_neighbors' : list(k_range),
'sfs__estimator__weights' : weights,
'sfs__estimator__leaf_size' : list(leaf_sizes)
}
seuclidean = {
'knn__metric' : ['seuclidean'],
'knn__metric_params': [ {'V':V} ], # will be automatically calculated
#'knn__algorithm' : ['ball_tree'] #, # TODO , ['brute', 'ball_tree'],
'knn__algorithm' : ['brute'], #, # TODO , ['brute', 'ball_tree'],
'knn__n_neighbors' : list(k_range),
'knn__weights' : weights,
'knn__leaf_size' : list(leaf_sizes)
}
# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())
sfs1 = SFS(estimator=knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
print_progress=False,
cv=5)
# !?!? n_jobs=-1)
pipe = Pipeline([
('standardize', preprocessing.MinMaxScaler()),
# ('sfs', sfs1),
('knn', knn)])
# See KNeighborsClassifier equivalent param_grid
param_grid = [
seuclidean
#sfs_seuclidean
]
# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
#n_jobs=-1, for better stack tracing
cv=5,
verbose=1,
refit=True)
# Run the grid search
gs = gs.fit(X_train, y_train) #.values, y_train)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-bc3e83cc3c1e> in <module>()
91
92 # Run the grid search
---> 93 gs = gs.fit(X_train, y_train) #.values, y_train)
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
802
803 """
--> 804 return self._fit(X, y, ParameterGrid(self.param_grid))
805
806
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
551 self.fit_params, return_parameters=True,
552 error_score=self.error_score)
--> 553 for parameters in parameter_iterable
554 for train, test in cv)
555
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
798 # was dispatched. In particular this covers the edge
799 # case of Parallel used with an exhausted iterator.
--> 800 while self.dispatch_one_batch(iterator):
801 self._iterating = True
802 else:
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
656 return False
657 else:
--> 658 self._dispatch(tasks)
659 return True
660
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
564
565 if self._pool is None:
--> 566 job = ImmediateComputeBatch(batch)
567 self._jobs.append(job)
568 self.n_dispatched_batches += 1
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
178 # Don't delay the application, to avoid keeping the input
179 # arguments in memory
--> 180 self.results = batch()
181
182 def get(self):
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1529 estimator.fit(X_train, **fit_params)
1530 else:
-> 1531 estimator.fit(X_train, y_train, **fit_params)
1532
1533 except Exception as e:
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
162 the pipeline.
163 """
--> 164 Xt, fit_params = self._pre_transform(X, y, **fit_params)
165 self.steps[-1][-1].fit(Xt, y, **fit_params)
166 return self
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in _pre_transform(self, X, y, **fit_params)
143 for name, transform in self.steps[:-1]:
144 if hasattr(transform, "fit_transform"):
--> 145 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
146 else:
147 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit_transform(self, X, y)
239
240 def fit_transform(self, X, y):
--> 241 self.fit(X, y)
242 return self.transform(X)
243
/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y)
136 self._inclusion(orig_set=orig_set,
137 subset=prev_subset,
--> 138 X=X, y=y)
139 else:
140 k_idx, k_score, cv_scores = \
/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _inclusion(self, orig_set, subset, X, y)
205 for feature in remaining:
206 new_subset = tuple(subset | {feature})
--> 207 cv_scores = self._calc_score(X, y, new_subset)
208 all_avg_scores.append(cv_scores.mean())
209 all_cv_scores.append(cv_scores)
/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _calc_score(self, X, y, indices)
190 scoring=self.scorer,
191 n_jobs=self.n_jobs,
--> 192 pre_dispatch=self.pre_dispatch)
193 else:
194 self.est_.fit(X[:, indices], y)
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
1431 train, test, verbose, None,
1432 fit_params)
-> 1433 for train, test in cv)
1434 return np.array(scores)[:, 0]
1435
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
798 # was dispatched. In particular this covers the edge
799 # case of Parallel used with an exhausted iterator.
--> 800 while self.dispatch_one_batch(iterator):
801 self._iterating = True
802 else:
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
656 return False
657 else:
--> 658 self._dispatch(tasks)
659 return True
660
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
564
565 if self._pool is None:
--> 566 job = ImmediateComputeBatch(batch)
567 self._jobs.append(job)
568 self.n_dispatched_batches += 1
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
178 # Don't delay the application, to avoid keeping the input
179 # arguments in memory
--> 180 self.results = batch()
181
182 def get(self):
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1548
1549 else:
-> 1550 test_score = _score(estimator, X_test, y_test, scorer)
1551 if return_train_score:
1552 train_score = _score(estimator, X_train, y_train, scorer)
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _score(estimator, X_test, y_test, scorer)
1604 score = scorer(estimator, X_test)
1605 else:
-> 1606 score = scorer(estimator, X_test, y_test)
1607 if not isinstance(score, numbers.Number):
1608 raise ValueError("scoring must return a number, got %s (%s) instead."
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/metrics/scorer.py in __call__(self, estimator, X, y_true, sample_weight)
81 Score function applied to prediction of estimator on X.
82 """
---> 83 y_pred = estimator.predict(X)
84 if sample_weight is not None:
85 return self._sign * self._score_func(y_true, y_pred,
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/classification.py in predict(self, X)
145 X = check_array(X, accept_sparse='csr')
146
--> 147 neigh_dist, neigh_ind = self.kneighbors(X)
148
149 classes_ = self.classes_
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
373 dist = pairwise_distances(
374 X, self._fit_X, self.effective_metric_, n_jobs=n_jobs,
--> 375 **self.effective_metric_params_)
376
377 neigh_ind = argpartition(dist, n_neighbors - 1, axis=1)
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1205 func = partial(distance.cdist, metric=metric, **kwds)
1206
-> 1207 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1208
1209
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1052 if n_jobs == 1:
1053 # Special case to avoid picklability checks in delayed
-> 1054 return func(X, Y, **kwds)
1055
1056 # TODO: in some cases, backend='threading' may be appropriate
/home/hmf/my_py3/lib/python3.4/site-packages/scipy/spatial/distance.py in cdist(XA, XB, metric, p, V, VI, w)
2151 'one-dimensional.')
2152 if V.shape[0] != n:
-> 2153 raise ValueError('Variance vector V must be of the same '
2154 'dimension as the vectors on which the '
2155 'distances are computed.')
ValueError: Variance vector V must be of the same dimension as the vectors on which the distances are computed.
Something is definitely fishy here. So upon further investigation I found that the calculation of the variance using the var function is not the same for panda's DataFrame and numpy's array.
Ah, sorry about that, I forgot that pandas does this differently by default :P To mimmic its behavior you also need to set ddof
to 1
np.var(X, axis=0, ddof=1)
Btw. your code with SFS works fine for me after uncommenting the lines you mentioned and making a little change to the k_neighbors in the sfs1 (see complete code at the end). I think your problem was that you initialized
sfs1 = SFS(estimator=knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
print_progress=False,
cv=5)
# !?!? n_jobs=-1)
and then ran
pipe = Pipeline([
('standardize', preprocessing.MinMaxScaler()),
('sfs', sfs1),
('knn', knn)])
# See KNeighborsClassifier equivalent param_grid
param_grid = [
seuclidean
#sfs_seuclidean
]
# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
#n_jobs=-1, for better stack tracing
cv=5,
verbose=1,
refit=True)
when I understood correctly? The problem here is that you have only 3 features selected then but 4 variance columns, which then causes the error. So, I suggest setting k_features=4
so that the seuclidean
paramgrid works fine, and you could do the feature selection then in the sfs_seuclidean
param_grid like belod:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend
# get the unormalized data
#X = dy[ dy.columns.difference(['label']).values ]
#y = dy['label'].values
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
#V = X_train.var()#.values
V = X_train.var(axis=0)
# k_range : must be less than the training size. What happens if number of features > sample size
k_range = range(1, X.shape[1])
weights = ['uniform' , 'distance']
#algos_all = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all = ['ball_tree', 'kd_tree', 'brute']
algos = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]
# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:
sfs_seuclidean = {
'sfs__k_features' : list(range(1, X.shape[1])),
'sfs__estimator__metric' : ['seuclidean'],
'sfs__estimator__metric_params': [ {'V':V} ], # will be automatically calculated
#'sfs__estimator__algorithm' : ['ball_tree'] #, # TODO , ['brute', 'ball_tree'],
'sfs__estimator__algorithm' : ['brute'], #, # TODO , ['brute', 'ball_tree'],
'sfs__estimator__n_neighbors' : list(k_range),
'sfs__estimator__weights' : weights,
'sfs__estimator__leaf_size' : list(leaf_sizes)
}
seuclidean = {
'knn__metric' : ['seuclidean'],
'knn__metric_params': [ {'V':V} ], # will be automatically calculated
#'knn__algorithm' : ['ball_tree'] #, # TODO , ['brute', 'ball_tree'],
'knn__algorithm' : ['brute'], #, # TODO , ['brute', 'ball_tree'],
'knn__n_neighbors' : list(k_range),
'knn__weights' : weights,
'knn__leaf_size' : list(leaf_sizes)
}
# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())
sfs1 = SFS(estimator=knn,
k_features=4,
forward=True,
floating=False,
scoring='accuracy',
print_progress=False,
cv=5)
# !?!? n_jobs=-1)
pipe = Pipeline([
('standardize', preprocessing.MinMaxScaler()),
('sfs', sfs1),
('knn', knn)])
# See KNeighborsClassifier equivalent param_grid
param_grid = [
seuclidean,
sfs_seuclidean
]
# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
#n_jobs=-1, for better stack tracing
cv=5,
verbose=1,
refit=True)
# Run the grid search
gs = gs.fit(X_train, y_train) #.values, y_train)
Unfortunately that did not work for. I copied and pasted your code as you have it above and I get:
Fitting 5 folds for each of 144 candidates, totalling 720 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 2.3s
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-799ff450e1fc> in <module>()
91
92 # Run the grid search
---> 93 gs = gs.fit(X_train, y_train) #.values, y_train)
/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
802
803 """
--> 804 return self._fit(X, y, ParameterGrid(self.param_grid))
805
806
So I have a few questions here:
k_features
parameter of SequentialFeatureSelector
used to indicate the maximum number of final features allowed? In the documentation its states "_Number of features to select, where kfeatures < the full feature set." seuclidean
and sfs_seuclidean
. Why is this? Note that I am using the final knn
to do the refit so as to get the best model for later use (something akin to one of your last examples in the SFS page).k_features
equal to the length of the variance vector. But this does not seem to make sense. As the SFS searchers for the sub-set of variable used in the model, shouldn't it also select the appropriate (corresponding) elements of the variance vector V
?Apologies for insisting but I simply cannot get this to execute.
TIA, HF
Hm, sorry for the trouble ... I just copy & pasted my code from above (the lower block) to a fresh Jupyter notebook and it works fine without any issues. I can't remember that there was a SFS upgrade that happened between your version and the latest dev version, but maybe it would be worthwhile upgrading just in case?
Isn't the k_features parameter of SequentialFeatureSelector used to indicate the maximum number of final features allowed? In the documentation its states "Number of features to select, where k_features < the full feature set."
Yes, that's right. Actually, it's the exact number of features. If k_features=2
, it will return exactly 2 features in the pipeline (not "up to 2 features", just to clarify :) )
I had assumed that I need only have one metric parameter, but in your code above you use both seuclideanand sfs_seuclidean. Why is this? Note that I am using the final knn to do the refit so as to get the best model for later use (something akin to one of your last examples in the SFS page).
yeah, I was just lazy and just uncommented the latter (but the former, seuclidean
, is probably redundant then). You can also run it as
param_grid = [
#seuclidean,
sfs_seuclidean
]
(Just checked, it works fine as well)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 2.5s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 10.8s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 23.4s
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed: 28.8s finished
gs.best_params_
{'sfs__estimator__algorithm': 'brute',
'sfs__estimator__leaf_size': 5,
'sfs__estimator__metric': 'seuclidean',
'sfs__estimator__metric_params': {'V': array([ 0.685564, 0.167216, 3.070371, 0.601184])},
'sfs__estimator__n_neighbors': 1,
'sfs__estimator__weights': 'uniform',
'sfs__k_features': 3}
You say that I need to set the k_features equal to the length of the variance vector. But this does not seem to make sense. As the SFS searchers for the sub-set of variable used in the model, shouldn't it also select the appropriate (corresponding) elements of the variance vector V?
Hm, sorry, yeah, I got a bit confused about that part ... sorry, just looked at the code only very briefly. I think the error had something to do with the part that
in
seuclidean = {
'knn__metric' : ['seuclidean'],
'knn__metric_params': [ {'V':V} ], # will be automatically calculated
#'knn__algorithm' : ['ball_tree'] #, # TODO , ['brute', 'ball_tree'],
'knn__algorithm' : ['brute'], #, # TODO , ['brute', 'ball_tree'],
'knn__n_neighbors' : list(k_range),
'knn__weights' : weights,
'knn__leaf_size' : list(leaf_sizes)
}
this k_features length was fixed to 3 (because it didn't had the sfs
parameters to select, and the variance was a 4-dimensional vector.
So, maybe one more thing worth noting is that there are 2 different knn
's here. The one inside the SFS and the one outside the SFS. Or in other words, one KNN estimator used for feature selection, and one used for the final classification. For example, if I modify the parameter grid as follows (see the last line):
sfs_seuclidean = {
'sfs__k_features' : list(range(1, X.shape[1])),
'sfs__estimator__metric' : ['seuclidean'],
'sfs__estimator__metric_params': [ {'V':V} ], # will be automatically calculated
#'sfs__estimator__algorithm' : ['ball_tree'] #, # TODO , ['brute', 'ball_tree'],
'sfs__estimator__algorithm' : ['brute'], #, # TODO , ['brute', 'ball_tree'],
'sfs__estimator__n_neighbors' : list(k_range),
'sfs__estimator__weights' : weights,
'sfs__estimator__leaf_size' : list(leaf_sizes),""
'knn__n_neighbors' : [1, 2, 3, 4],
}
you'll get the following:
gs = gs.fit(X_train, y_train)
Fitting 5 folds for each of 432 candidates, totalling 2160 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 3.1s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 12.6s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 26.9s
[Parallel(n_jobs=1)]: Done 799 tasks | elapsed: 45.3s
[Parallel(n_jobs=1)]: Done 1249 tasks | elapsed: 1.1min
[Parallel(n_jobs=1)]: Done 1799 tasks | elapsed: 1.7min
[Parallel(n_jobs=1)]: Done 2160 out of 2160 | elapsed: 2.0min finished
gs.best_params_
{'knn__n_neighbors': 4,
'sfs__estimator__algorithm': 'brute',
'sfs__estimator__leaf_size': 5,
'sfs__estimator__metric': 'seuclidean',
'sfs__estimator__metric_params': {'V': array([ 0.685564, 0.167216, 3.070371, 0.601184])},
'sfs__estimator__n_neighbors': 1,
'sfs__estimator__weights': 'uniform',
'sfs__k_features': 3}
EDIT
I just see that I made an update to the SFS in May indeed, in this case, it's a crucial one!(https://github.com/rasbt/mlxtend/commit/84a90ab9929b311127968fef4467c7b5198780f7)
Now, the SFS clones the estimator by default, which I think is the better, recommended default behavior. In your older version, the same knn
object instance is used both inside and outside the SFS, which is probably responsible for the bug that you are having. I bet that it will work fine if you update the mlxtend version. Sorry for the trouble here!
Hm, sorry for the trouble ...
Please don't apologise - I appreciate you taking the time to help.
I just see that I made an update to the SFS in May indeed, in this case, it's a crucial one!
I have installed the dev version and it working! Thank you.
I would just like to clear up some doubts I still have. Sorry, but I am a little slow on the uptake.
Isn't the k_features parameter of SequentialFeatureSelector used to indicate the maximum number of final features allowed? In the documentation its states "Number of features to select, where k_features < the full feature set."
Yes, that's right. Actually, it's the exact number of features. If k_features=2, it will return exactly 2 features in the pipeline (not "up to 2 features", just to clarify :) )
I used the code above and kept the k_features = 4
is the SFS parameters. The result I get is :
{'sfs__estimator__algorithm': 'brute',
'sfs__estimator__leaf_size': 5,
'sfs__estimator__metric': 'seuclidean',
'sfs__estimator__metric_params': {'V': array([ 0.685564, 0.167216, 3.070371, 0.601184])},
'sfs__estimator__n_neighbors': 1,
'sfs__estimator__weights': 'uniform',
'sfs__k_features': 3}
Notice how 'sfs__k_features': 3
. In fact I can set that SFS k_features
parameter to any value (say 6
) because what counts is the 'sfs__k_features' : list(range(1, X.shape[1]))
, as these are the values of k_features
that will be used by SFS in the grid search. SFS select the best sub-set of features of a given size, GridSearchCV
selects the best sub-set size (in this case 3 features). Is this correct?
So, maybe one more thing worth noting is that there are 2 different knn's here. The one inside the SFS and the one outside the SFS. Or in other words, one KNN estimator used for feature selection, and one used for the final classification. For example, if I modify the parameter grid as follows (see the last line):
sfs_seuclidean = { 'sfs__k_features' : list(range(1, X.shape[1])), 'sfs__estimator__metric' : ['seuclidean'], 'sfs__estimator__metric_params': [ {'V':V} ], # will be automatically calculated #'sfs__estimator__algorithm' : ['ball_tree'] #, # TODO , ['brute', 'ball_tree'], 'sfs__estimator__algorithm' : ['brute'], #, # TODO , ['brute', 'ball_tree'], 'sfs__estimator__n_neighbors' : list(k_range), 'sfs__estimator__weights' : weights, 'sfs__estimator__leaf_size' : list(leaf_sizes),"" 'knn__n_neighbors' : [1, 2, 3, 4], }
So here are you saying that for each of the SFS's 540
fits the last knn
will be executed for 4 different neighbour sizes (4x540 = 2160)? Doesn't the last knn
simply used the best neighbour size from 'sfs__estimator__n_neighbors' : list(k_range),
list? In't it simpy repeating the same calculations?
Once again thanks for the help.
Notice how 'sfsk_features': 3. In fact I can set that SFS k_features parameter to any value (say 6) because what counts is the 'sfsk_features' : list(range(1, X.shape[1])), as these are the values of k_features that will be used by SFS in the grid search. SFS select the best sub-set of features of a given size, GridSearchCV selects the best sub-set size (in this case 3 features). Is this correct?
Yes! But prior to that, i.e., in your old version, you encountered this V-related dimensionality issue because you pre-computed the variance based on 4 features, then you used 3 features to feed it to the algo. Now, it doesn't matter anymore how you seed the k_features since you modify them during the grid search anyway; the k_features = 4
becomes just a "placeholder".
So here are you saying that for each of the SFS's 540 fits the last knn will be executed for 4 different neighbour sizes (4x540 = 2160)? Doesn't the last knn simply used the best neighbour size from 'sfsestimatorn_neighbors' : list(k_range), list? In't it simpy repeating the same calculations?
Yeah, that sounds correct. So basically, it works like this now:
n_neighbors
k_features
to the separate KNN classifier with n_neighbors
Or maybe let's say we have a LogisticRegression classifier in the SFS to make thinks more clear
sfs1 = SFS(estimator=LogisticRegression(),
k_features=4,
forward=True,
floating=False,
scoring='accuracy',
print_progress=False,
cv=5)
# !?!? n_jobs=-1)
pipe = Pipeline([
('standardize', preprocessing.MinMaxScaler()),
('sfs', sfs1),
('knn', knn)])
k_features
to the KNN classifier with n_neighbors
Now, you can use the same classifier instance for both selection and classification, e.g.,
lr = LogisticRegression()
sfs1 = SFS(estimator=lr,
clone=False,
k_features=4,
forward=True,
floating=False,
scoring='accuracy',
print_progress=False,
cv=5)
# !?!? n_jobs=-1)
pipe = Pipeline([
('standardize', preprocessing.MinMaxScaler()),
('sfs', sfs1),
('clf', lr)])
However, in your case, we have the problem then that the V dimensions from SFS don't match the classifier's in the pipeline and vice versa
The explanation above is clear, thank you. However, from you last sentence I can conclude that I cannot trust the the result I am getting because the V
variance is not being used correctly. I did a little digging and found the source:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py
which led me to:
https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py
It seems like If I don't provide a V
, one will be calculated automatically.
elif mstr in ['seuclidean', 'se', 's']:
X = _convert_to_double(X)
if V is not None:
V = np.asarray(V, order='c')
if V.dtype != np.double:
raise TypeError('Variance vector V must contain doubles.')
if len(V.shape) != 1:
raise ValueError('Variance vector V must '
'be one-dimensional.')
if V.shape[0] != n:
raise ValueError('Variance vector V must be of the same '
'dimension as the vectors on which the distances '
'are computed.')
# The C code doesn't do striding.
VV = _copy_array_if_base_present(_convert_to_double(V))
So I redid the test with the brute
and ball_tree
algorithms (seperately) without setting the V
variable. This worked:
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 1.6s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 6.7s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 15.3s
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed: 18.4s finished
and produced the same results:
{'sfs__estimator__algorithm': 'ball_tree',
'sfs__estimator__leaf_size': 5,
'sfs__estimator__metric': 'seuclidean',
'sfs__estimator__n_neighbors': 1,
'sfs__estimator__weights': 'uniform',
'sfs__k_features': 3}
Strangely enough I recall trying this before but it failed. I cannot remember what I did differently. Anyway problem solved, I have the solution.
Thank you for your help.
EDIT: maybe my initial tests, were I did not provide a V
, failed because I was using an earlier version of mlxtend
.
Hello,
I posted this question in the Google groups but it does not seem to attract any attention. So I am posting this here. If this is not correct, please tell me.
I have taken some Scikit source code that used the standard grid search and adapted it to using a pipe with the use of the SFS. I use the the "seuclidean" metric with the ball-tree algorithm that requires a metric parameter - a variance vector. When I execute the Scikit standard code I have no problem. However with the SFS in a Pipeline I have two errors:
TypeError: __init__() takes exactly 1 positional argument (0 given)
ValueError: SEuclidean dist: size of V does not match
Error 2 is understandable - because SFS does feature selection, I cannot pre-calculate this value. It depends on the features used. I was expecting the metric parameters to be automatically calculate and therefore not require this input. I also tried to pass
None
as the parameter, but with no success.Can anyone shed light on how I should proceed? I have added my code below in case this helps (data sets managed with Pandas).
TIA, Hugo
Stack Trace 1
Fitting 5 folds for each of 1200 candidates, totalling 6000 fits
Stack Trace 2
Fitting 5 folds for each of 1200 candidates, totalling 6000 fits