rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.92k stars 873 forks source link

Pipe using `SequentialFeatureSelector`: problems passing metric parameters #69

Closed hmf closed 8 years ago

hmf commented 8 years ago

Hello,

I posted this question in the Google groups but it does not seem to attract any attention. So I am posting this here. If this is not correct, please tell me.

I have taken some Scikit source code that used the standard grid search and adapted it to using a pipe with the use of the SFS. I use the the "seuclidean" metric with the ball-tree algorithm that requires a metric parameter - a variance vector. When I execute the Scikit standard code I have no problem. However with the SFS in a Pipeline I have two errors:

  1. If I do not provide the metric's parameters I get the (see stack trace 1): TypeError: __init__() takes exactly 1 positional argument (0 given)
  2. If I provide the parameter I get (see stack trace 2): ValueError: SEuclidean dist: size of V does not match

Error 2 is understandable - because SFS does feature selection, I cannot pre-calculate this value. It depends on the features used. I was expecting the metric parameters to be automatically calculate and therefore not require this input. I also tried to pass None as the parameter, but with no success.

Can anyone shed light on how I should proceed? I have added my code below in case this helps (data sets managed with Pandas).

TIA, Hugo


import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing

# get the unormalized data
X = dy[ dy.columns.difference(['label']).values ]
y = dy['label'].values                           

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

V = X_train.var().values
C = X_train.cov().values
CPI = np.linalg.pinv(C)
CI = np.linalg.inv(C)

# k_range : must be less than the training size. What happens if number of features > sample size
k_range    = range(1, len(X.columns))
weights    = ['uniform' , 'distance']
#algos_all  = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all  = ['ball_tree', 'kd_tree', 'brute']
algos      = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)   
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]

# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:

seuclidean = {
    'sfs__k_features'              : list(range(1,len(X.columns))),
    'sfs__estimator__metric'       : ['seuclidean'],
    'sfs__estimator__metric_params': [ {'V':V} ],  # will be automatically calculated
    'sfs__estimator__algorithm'    : ['ball_tree'],  # TODO , ['brute', 'ball_tree'],
    'sfs__estimator__n_neighbors'  : list(k_range),
    'sfs__estimator__weights'      : weights,
    'sfs__estimator__leaf_size'    : list(leaf_sizes) }

from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend

# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())

sfs1 = SFS(estimator=knn,
           k_features=3,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
                 ('sfs', sfs1),
                 ('knn', knn)])

# See KNeighborsClassifier equivalent param_grid
param_grid = [
    seuclidean
  ]

# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  scoring='accuracy',
                  #n_jobs=-1,  for better stack tracing
                  cv=5,
                  verbose=1,
                  refit=True)

# Run the grid search
gs = gs.fit(X_train.values, y_train)

Stack Trace 1

Fitting 5 folds for each of 1200 candidates, totalling 6000 fits

TypeError                                 Traceback (most recent call last)
<ipython-input-68-4ef553dad211> in <module>()
    167
    168 # Run the grid search
--> 169 gs = gs.fit(X_train.values, y_train)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
    802
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805
    806

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
    551                                     self.fit_params, return_parameters=True,
    552                                     error_score=self.error_score)
--> 553                 for parameters in parameter_iterable
    554                 for train, test in cv)
    555

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    162             the pipeline.
    163         """
--> 164         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    165         self.steps[-1][-1].fit(Xt, y, **fit_params)
    166         return self

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in _pre_transform(self, X, y, **fit_params)
    143         for name, transform in self.steps[:-1]:
    144             if hasattr(transform, "fit_transform"):
--> 145                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    146             else:
    147                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit_transform(self, X, y)
    239
    240     def fit_transform(self, X, y):
--> 241         self.fit(X, y)
    242         return self.transform(X)
    243

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y)
    136                     self._inclusion(orig_set=orig_set,
    137                                     subset=prev_subset,
--> 138                                     X=X, y=y)
    139             else:
    140                 k_idx, k_score, cv_scores = \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _inclusion(self, orig_set, subset, X, y)
    205             for feature in remaining:
    206                 new_subset = tuple(subset | {feature})
--> 207                 cv_scores = self._calc_score(X, y, new_subset)
    208                 all_avg_scores.append(cv_scores.mean())
    209                 all_cv_scores.append(cv_scores)

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _calc_score(self, X, y, indices)
    190                                      scoring=self.scorer,
    191                                      n_jobs=self.n_jobs,
--> 192                                      pre_dispatch=self.pre_dispatch)
    193         else:
    194             self.est_.fit(X[:, indices], y)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1431                                               train, test, verbose, None,
   1432                                               fit_params)
-> 1433                       for train, test in cv)
   1434     return np.array(scores)[:, 0]
   1435

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in fit(self, X, y)
    801             self._y = self._y.ravel()
    802
--> 803         return self._fit(X)
    804
    805

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in _fit(self, X)
    256             self._tree = BallTree(X, self.leaf_size,
    257                                   metric=self.effective_metric_,
--> 258                                   **self.effective_metric_params_)
    259         elif self._fit_method == 'kd_tree':
    260             self._tree = KDTree(X, self.leaf_size,

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:8381)()

sklearn/neighbors/dist_metrics.pyx in sklearn.neighbors.dist_metrics.DistanceMetric.get_metric (sklearn/neighbors/dist_metrics.c:4330)()

sklearn/neighbors/dist_metrics.pyx in sklearn.neighbors.dist_metrics.SEuclideanDistance.__init__ (sklearn/neighbors/dist_metrics.c:5888)()

TypeError: __init__() takes exactly 1 positional argument (0 given)

Stack Trace 2

Fitting 5 folds for each of 1200 candidates, totalling 6000 fits

ValueError                                Traceback (most recent call last)
<ipython-input-69-558dd50887b6> in <module>()
    167
    168 # Run the grid search
--> 169 gs = gs.fit(X_train.values, y_train)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
    802
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805
    806

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
    551                                     self.fit_params, return_parameters=True,
    552                                     error_score=self.error_score)
--> 553                 for parameters in parameter_iterable
    554                 for train, test in cv)
    555

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    162             the pipeline.
    163         """
--> 164         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    165         self.steps[-1][-1].fit(Xt, y, **fit_params)
    166         return self

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in _pre_transform(self, X, y, **fit_params)
    143         for name, transform in self.steps[:-1]:
    144             if hasattr(transform, "fit_transform"):
--> 145                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    146             else:
    147                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit_transform(self, X, y)
    239
    240     def fit_transform(self, X, y):
--> 241         self.fit(X, y)
    242         return self.transform(X)
    243

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y)
    136                     self._inclusion(orig_set=orig_set,
    137                                     subset=prev_subset,
--> 138                                     X=X, y=y)
    139             else:
    140                 k_idx, k_score, cv_scores = \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _inclusion(self, orig_set, subset, X, y)
    205             for feature in remaining:
    206                 new_subset = tuple(subset | {feature})
--> 207                 cv_scores = self._calc_score(X, y, new_subset)
    208                 all_avg_scores.append(cv_scores.mean())
    209                 all_cv_scores.append(cv_scores)

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _calc_score(self, X, y, indices)
    190                                      scoring=self.scorer,
    191                                      n_jobs=self.n_jobs,
--> 192                                      pre_dispatch=self.pre_dispatch)
    193         else:
    194             self.est_.fit(X[:, indices], y)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1431                                               train, test, verbose, None,
   1432                                               fit_params)
-> 1433                       for train, test in cv)
   1434     return np.array(scores)[:, 0]
   1435

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in fit(self, X, y)
    801             self._y = self._y.ravel()
    802
--> 803         return self._fit(X)
    804
    805

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in _fit(self, X)
    256             self._tree = BallTree(X, self.leaf_size,
    257                                   metric=self.effective_metric_,
--> 258                                   **self.effective_metric_params_)
    259         elif self._fit_method == 'kd_tree':
    260             self._tree = KDTree(X, self.leaf_size,

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:8793)()

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree._recursive_build (sklearn/neighbors/ball_tree.c:10053)()

sklearn/neighbors/ball_tree.pyx in sklearn.neighbors.ball_tree.init_node (sklearn/neighbors/ball_tree.c:20030)()

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree.rdist (sklearn/neighbors/ball_tree.c:9932)()

sklearn/neighbors/dist_metrics.pyx in sklearn.neighbors.dist_metrics.SEuclideanDistance.rdist (sklearn/neighbors/dist_metrics.c:6065)()

ValueError: SEuclidean dist: size of V does not match
rasbt commented 8 years ago

Hi, @hmf . Sorry for the silence, just got back from SciPy & Texas; I will try to take a closer look at it on the weekend!

hmf commented 8 years ago

Thank you.

rasbt commented 8 years ago

Hi, Hugo, hm, I am not completely sure why you are getting this error, but I think it may be related to passing Pandas DataFrames somewhere. I.e., to run your example code, I loaded the iris dataset and had no issues with that. So, may I suggest to just use something like

X, y = X.values, y.values 

at the beginning of your code and adjust the following lines, e.g.,:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.datasets import load_iris

# get the unormalized data
#X = dy[ dy.columns.difference(['label']).values ]
#y = dy['label'].values                           

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

V = X_train.var()#.values
C = np.cov(X_train)#.values
CPI = np.linalg.pinv(C)
CI = np.linalg.inv(C)

# k_range : must be less than the training size. What happens if number of features > sample size
k_range    = range(1, X.shape[1])
weights    = ['uniform' , 'distance']
#algos_all  = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all  = ['ball_tree', 'kd_tree', 'brute']
algos      = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)   
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]

# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:

seuclidean = {
    'sfs__k_features'              : list(range(1, X.shape[1])),
    'sfs__estimator__metric'       : ['seuclidean'],
    'sfs__estimator__metric_params': [ {'V':V} ],  # will be automatically calculated
    'sfs__estimator__algorithm'    : ['ball_tree'],  # TODO , ['brute', 'ball_tree'],
    'sfs__estimator__n_neighbors'  : list(k_range),
    'sfs__estimator__weights'      : weights,
    'sfs__estimator__leaf_size'    : list(leaf_sizes) }

from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend

# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())

sfs1 = SFS(estimator=knn,
           k_features=3,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
                 ('sfs', sfs1),
                 ('knn', knn)])

# See KNeighborsClassifier equivalent param_grid
param_grid = [
    seuclidean
  ]

# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  scoring='accuracy',
                  #n_jobs=-1,  for better stack tracing
                  cv=5,
                  verbose=1,
                  refit=True)

# Run the grid search
gs = gs.fit(X_train, y_train) #.values, y_train)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    2.5s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   10.4s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:   25.2s
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed:   30.3s finished

I am closing this issue since it doesn't seem to be a mlxtend code-related issue, but please feel free to comment further on it :)

Cheers, Sebastian

hmf commented 8 years ago

Appreciate you looking into it. I copied and pasted your code into a cell to make sure I did not mess up. Unfortunately now I get:

[I 08:40:56.355 NotebookApp] KernelRestarter: restarting kernel (1/5) WARNING:root:kernel ca73f90f-7cd8-4c45-8d93-c146f8d30738 restarted [I 08:41:09.265 NotebookApp] Kernel shutdown: ca73f90f-7cd8-4c45-8d93-c146f8d30738

In addition to this I cannot reproduce my previous results. I usually update my virtualenv regularly, so it may be due to a chanage in some other package. I will have to look at this more carefully.

Apologies for the "noise".

Regards, HF

hmf commented 8 years ago

Hi Sebastian,

Something is definitely fishy here. So upon further investigation I found that the calculation of the variance using the var function is not the same for panda's DataFrame and numpy's array. More specifically if in numpy you do not define the axis, then the variance is for the flattened matrix. This should result in an exception because the Iris dataset has 4 columns so the the variance V should be an array of length 4. However, I get kernel failure (when using the ball tree metric; so I have done the tests here using the "brute" metric). To see the difference in the variance calculations you can use the code below:

import pandas as pd 

print(X_train.var())
print()
print(pd.DataFrame(X_train).var())
print(X_train.var(axis=0))

And the output should be something like:

3.8504924375

0 0.692489 1 0.168905 2 3.101385 3 0.607257 dtype: float64 [ 0.685564 0.167216 3.070371 0.601184]`

So I reused the code you had. I now have 2 metric configurations: sfs_seuclidean and seuclidean. We use the pipe to send the data and parameters through the SFS with the first configuration or let it go directly to the KNN via the second configuration. The example (see code below) has pipe with the SFS commented out and the param_grid set to the second configuration. I get the following output:

Fitting 5 folds for each of 36 candidates, totalling 180 fits

[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:    0.3s finished

If however I use the SFS configuration (uncomment the sfs and use the alternate configuration ) the I get an exception (see end of message for full stack trace):

/home/hmf/my_py3/lib/python3.4/site-packages/scipy/spatial/distance.py in cdist(XA, XB, metric, p, V, VI, w)
   2151                                      'one-dimensional.')
   2152                 if V.shape[0] != n:
-> 2153                     raise ValueError('Variance vector V must be of the same '
   2154                                      'dimension as the vectors on which the '
   2155                                      'distances are computed.')

ValueError: Variance vector V must be of the same dimension as the vectors on which the distances are computed.

I am using Version: 0.4.1 of mlxtend via a virtualenv installation. Could you use the code below to check If the diagnosis above is correct?

TIA, Hugo

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.datasets import load_iris

from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend

# get the unormalized data
#X = dy[ dy.columns.difference(['label']).values ]
#y = dy['label'].values                           

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

#V = X_train.var()#.values
V = X_train.var(axis=0)

# k_range : must be less than the training size. What happens if number of features > sample size
k_range    = range(1, X.shape[1])
weights    = ['uniform' , 'distance']
#algos_all  = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all  = ['ball_tree', 'kd_tree', 'brute']
algos      = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)   
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]

# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:

sfs_seuclidean = {
    'sfs__k_features'              : list(range(1, X.shape[1])),
    'sfs__estimator__metric'       : ['seuclidean'],
    'sfs__estimator__metric_params': [ {'V':V} ],  # will be automatically calculated
    #'sfs__estimator__algorithm'    : ['ball_tree'] #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__algorithm'    : ['brute'], #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__n_neighbors'  : list(k_range),
     'sfs__estimator__weights'      : weights,
     'sfs__estimator__leaf_size'    : list(leaf_sizes) 
}

seuclidean = {
    'knn__metric'       : ['seuclidean'],
    'knn__metric_params': [ {'V':V} ],  # will be automatically calculated
    #'knn__algorithm'    : ['ball_tree'] #,  # TODO , ['brute', 'ball_tree'],
     'knn__algorithm'    : ['brute'], #,  # TODO , ['brute', 'ball_tree'],
     'knn__n_neighbors'  : list(k_range),
     'knn__weights'      : weights,
     'knn__leaf_size'    : list(leaf_sizes) 
}

# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())

sfs1 = SFS(estimator=knn,
           k_features=3,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
#                 ('sfs', sfs1),
                 ('knn', knn)])

# See KNeighborsClassifier equivalent param_grid
param_grid = [
    seuclidean
    #sfs_seuclidean
  ]

# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  scoring='accuracy',
                  #n_jobs=-1,  for better stack tracing
                  cv=5,
                  verbose=1,
                  refit=True)

# Run the grid search
gs = gs.fit(X_train, y_train) #.values, y_train)

Stack Trace

Fitting 5 folds for each of 108 candidates, totalling 540 fits

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-bc3e83cc3c1e> in <module>()
     91 
     92 # Run the grid search
---> 93 gs = gs.fit(X_train, y_train) #.values, y_train)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
    802 
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805 
    806 

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
    551                                     self.fit_params, return_parameters=True,
    552                                     error_score=self.error_score)
--> 553                 for parameters in parameter_iterable
    554                 for train, test in cv)
    555 

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660 

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564 
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181 
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532 
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    162             the pipeline.
    163         """
--> 164         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    165         self.steps[-1][-1].fit(Xt, y, **fit_params)
    166         return self

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in _pre_transform(self, X, y, **fit_params)
    143         for name, transform in self.steps[:-1]:
    144             if hasattr(transform, "fit_transform"):
--> 145                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    146             else:
    147                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit_transform(self, X, y)
    239 
    240     def fit_transform(self, X, y):
--> 241         self.fit(X, y)
    242         return self.transform(X)
    243 

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y)
    136                     self._inclusion(orig_set=orig_set,
    137                                     subset=prev_subset,
--> 138                                     X=X, y=y)
    139             else:
    140                 k_idx, k_score, cv_scores = \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _inclusion(self, orig_set, subset, X, y)
    205             for feature in remaining:
    206                 new_subset = tuple(subset | {feature})
--> 207                 cv_scores = self._calc_score(X, y, new_subset)
    208                 all_avg_scores.append(cv_scores.mean())
    209                 all_cv_scores.append(cv_scores)

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _calc_score(self, X, y, indices)
    190                                      scoring=self.scorer,
    191                                      n_jobs=self.n_jobs,
--> 192                                      pre_dispatch=self.pre_dispatch)
    193         else:
    194             self.est_.fit(X[:, indices], y)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1431                                               train, test, verbose, None,
   1432                                               fit_params)
-> 1433                       for train, test in cv)
   1434     return np.array(scores)[:, 0]
   1435 

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660 

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564 
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181 
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1548 
   1549     else:
-> 1550         test_score = _score(estimator, X_test, y_test, scorer)
   1551         if return_train_score:
   1552             train_score = _score(estimator, X_train, y_train, scorer)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _score(estimator, X_test, y_test, scorer)
   1604         score = scorer(estimator, X_test)
   1605     else:
-> 1606         score = scorer(estimator, X_test, y_test)
   1607     if not isinstance(score, numbers.Number):
   1608         raise ValueError("scoring must return a number, got %s (%s) instead."

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/metrics/scorer.py in __call__(self, estimator, X, y_true, sample_weight)
     81             Score function applied to prediction of estimator on X.
     82         """
---> 83         y_pred = estimator.predict(X)
     84         if sample_weight is not None:
     85             return self._sign * self._score_func(y_true, y_pred,

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/classification.py in predict(self, X)
    145         X = check_array(X, accept_sparse='csr')
    146 
--> 147         neigh_dist, neigh_ind = self.kneighbors(X)
    148 
    149         classes_ = self.classes_

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    373                 dist = pairwise_distances(
    374                     X, self._fit_X, self.effective_metric_, n_jobs=n_jobs,
--> 375                     **self.effective_metric_params_)
    376 
    377             neigh_ind = argpartition(dist, n_neighbors - 1, axis=1)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1205         func = partial(distance.cdist, metric=metric, **kwds)
   1206 
-> 1207     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1208 
   1209 

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1052     if n_jobs == 1:
   1053         # Special case to avoid picklability checks in delayed
-> 1054         return func(X, Y, **kwds)
   1055 
   1056     # TODO: in some cases, backend='threading' may be appropriate

/home/hmf/my_py3/lib/python3.4/site-packages/scipy/spatial/distance.py in cdist(XA, XB, metric, p, V, VI, w)
   2151                                      'one-dimensional.')
   2152                 if V.shape[0] != n:
-> 2153                     raise ValueError('Variance vector V must be of the same '
   2154                                      'dimension as the vectors on which the '
   2155                                      'distances are computed.')

ValueError: Variance vector V must be of the same dimension as the vectors on which the distances are computed.
rasbt commented 8 years ago

Something is definitely fishy here. So upon further investigation I found that the calculation of the variance using the var function is not the same for panda's DataFrame and numpy's array.

Ah, sorry about that, I forgot that pandas does this differently by default :P To mimmic its behavior you also need to set ddof to 1

np.var(X, axis=0, ddof=1)

Btw. your code with SFS works fine for me after uncommenting the lines you mentioned and making a little change to the k_neighbors in the sfs1 (see complete code at the end). I think your problem was that you initialized

sfs1 = SFS(estimator=knn,
           k_features=3,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

and then ran

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
                 ('sfs', sfs1),
                 ('knn', knn)])

# See KNeighborsClassifier equivalent param_grid
param_grid = [
    seuclidean
    #sfs_seuclidean
  ]

# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  scoring='accuracy',
                  #n_jobs=-1,  for better stack tracing
                  cv=5,
                  verbose=1,
                  refit=True)

when I understood correctly? The problem here is that you have only 3 features selected then but 4 variance columns, which then causes the error. So, I suggest setting k_features=4 so that the seuclidean paramgrid works fine, and you could do the feature selection then in the sfs_seuclidean param_grid like belod:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.datasets import load_iris

from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend

# get the unormalized data
#X = dy[ dy.columns.difference(['label']).values ]
#y = dy['label'].values                           

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

#V = X_train.var()#.values
V = X_train.var(axis=0)

# k_range : must be less than the training size. What happens if number of features > sample size
k_range    = range(1, X.shape[1])
weights    = ['uniform' , 'distance']
#algos_all  = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all  = ['ball_tree', 'kd_tree', 'brute']
algos      = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)   
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]

# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:

sfs_seuclidean = {
    'sfs__k_features'              : list(range(1, X.shape[1])),
    'sfs__estimator__metric'       : ['seuclidean'],
    'sfs__estimator__metric_params': [ {'V':V} ],  # will be automatically calculated
    #'sfs__estimator__algorithm'    : ['ball_tree'] #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__algorithm'    : ['brute'], #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__n_neighbors'  : list(k_range),
     'sfs__estimator__weights'      : weights,
     'sfs__estimator__leaf_size'    : list(leaf_sizes) 
}

seuclidean = {
    'knn__metric'       : ['seuclidean'],
    'knn__metric_params': [ {'V':V} ],  # will be automatically calculated
    #'knn__algorithm'    : ['ball_tree'] #,  # TODO , ['brute', 'ball_tree'],
     'knn__algorithm'    : ['brute'], #,  # TODO , ['brute', 'ball_tree'],
     'knn__n_neighbors'  : list(k_range),
     'knn__weights'      : weights,
     'knn__leaf_size'    : list(leaf_sizes) 
}

# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())

sfs1 = SFS(estimator=knn,
           k_features=4,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
                 ('sfs', sfs1),
                 ('knn', knn)])

# See KNeighborsClassifier equivalent param_grid
param_grid = [
    seuclidean,
    sfs_seuclidean
  ]

# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  scoring='accuracy',
                  #n_jobs=-1,  for better stack tracing
                  cv=5,
                  verbose=1,
                  refit=True)

# Run the grid search
gs = gs.fit(X_train, y_train) #.values, y_train)
hmf commented 8 years ago

Unfortunately that did not work for. I copied and pasted your code as you have it above and I get:

Fitting 5 folds for each of 144 candidates, totalling 720 fits

[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    2.3s

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-799ff450e1fc> in <module>()
     91 
     92 # Run the grid search
---> 93 gs = gs.fit(X_train, y_train) #.values, y_train)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
    802 
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805 
    806 

So I have a few questions here:

  1. Isn't the k_features parameter of SequentialFeatureSelector used to indicate the maximum number of final features allowed? In the documentation its states "_Number of features to select, where kfeatures < the full feature set."
  2. I had assumed that I need only have one metric parameter, but in your code above you use both seuclideanand sfs_seuclidean. Why is this? Note that I am using the final knn to do the refit so as to get the best model for later use (something akin to one of your last examples in the SFS page).
  3. You say that I need to set the k_features equal to the length of the variance vector. But this does not seem to make sense. As the SFS searchers for the sub-set of variable used in the model, shouldn't it also select the appropriate (corresponding) elements of the variance vector V?

Apologies for insisting but I simply cannot get this to execute.

TIA, HF

rasbt commented 8 years ago

Hm, sorry for the trouble ... I just copy & pasted my code from above (the lower block) to a fresh Jupyter notebook and it works fine without any issues. I can't remember that there was a SFS upgrade that happened between your version and the latest dev version, but maybe it would be worthwhile upgrading just in case?

Isn't the k_features parameter of SequentialFeatureSelector used to indicate the maximum number of final features allowed? In the documentation its states "Number of features to select, where k_features < the full feature set."

Yes, that's right. Actually, it's the exact number of features. If k_features=2, it will return exactly 2 features in the pipeline (not "up to 2 features", just to clarify :) )

I had assumed that I need only have one metric parameter, but in your code above you use both seuclideanand sfs_seuclidean. Why is this? Note that I am using the final knn to do the refit so as to get the best model for later use (something akin to one of your last examples in the SFS page).

yeah, I was just lazy and just uncommented the latter (but the former, seuclidean, is probably redundant then). You can also run it as

param_grid = [
    #seuclidean,
    sfs_seuclidean
  ]

(Just checked, it works fine as well)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    2.5s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   10.8s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:   23.4s
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed:   28.8s finished
gs.best_params_
{'sfs__estimator__algorithm': 'brute',
 'sfs__estimator__leaf_size': 5,
 'sfs__estimator__metric': 'seuclidean',
 'sfs__estimator__metric_params': {'V': array([ 0.685564,  0.167216,  3.070371,  0.601184])},
 'sfs__estimator__n_neighbors': 1,
 'sfs__estimator__weights': 'uniform',
 'sfs__k_features': 3}

You say that I need to set the k_features equal to the length of the variance vector. But this does not seem to make sense. As the SFS searchers for the sub-set of variable used in the model, shouldn't it also select the appropriate (corresponding) elements of the variance vector V?

Hm, sorry, yeah, I got a bit confused about that part ... sorry, just looked at the code only very briefly. I think the error had something to do with the part that

in

seuclidean = {
    'knn__metric'       : ['seuclidean'],
    'knn__metric_params': [ {'V':V} ],  # will be automatically calculated
    #'knn__algorithm'    : ['ball_tree'] #,  # TODO , ['brute', 'ball_tree'],
     'knn__algorithm'    : ['brute'], #,  # TODO , ['brute', 'ball_tree'],
     'knn__n_neighbors'  : list(k_range),
     'knn__weights'      : weights,
     'knn__leaf_size'    : list(leaf_sizes) 
}

this k_features length was fixed to 3 (because it didn't had the sfs parameters to select, and the variance was a 4-dimensional vector.

So, maybe one more thing worth noting is that there are 2 different knn's here. The one inside the SFS and the one outside the SFS. Or in other words, one KNN estimator used for feature selection, and one used for the final classification. For example, if I modify the parameter grid as follows (see the last line):

sfs_seuclidean = {
    'sfs__k_features'              : list(range(1, X.shape[1])),
    'sfs__estimator__metric'       : ['seuclidean'],
    'sfs__estimator__metric_params': [ {'V':V} ],  # will be automatically calculated
    #'sfs__estimator__algorithm'    : ['ball_tree'] #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__algorithm'    : ['brute'], #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__n_neighbors'  : list(k_range),
     'sfs__estimator__weights'      : weights,
     'sfs__estimator__leaf_size'    : list(leaf_sizes),"" 
     'knn__n_neighbors'             : [1, 2, 3, 4],
}

you'll get the following:

gs = gs.fit(X_train, y_train)
Fitting 5 folds for each of 432 candidates, totalling 2160 fits
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    3.1s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   12.6s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:   26.9s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:   45.3s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:  1.1min
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:  1.7min
[Parallel(n_jobs=1)]: Done 2160 out of 2160 | elapsed:  2.0min finished
gs.best_params_
{'knn__n_neighbors': 4,
 'sfs__estimator__algorithm': 'brute',
 'sfs__estimator__leaf_size': 5,
 'sfs__estimator__metric': 'seuclidean',
 'sfs__estimator__metric_params': {'V': array([ 0.685564,  0.167216,  3.070371,  0.601184])},
 'sfs__estimator__n_neighbors': 1,
 'sfs__estimator__weights': 'uniform',
 'sfs__k_features': 3}

EDIT

I just see that I made an update to the SFS in May indeed, in this case, it's a crucial one!(https://github.com/rasbt/mlxtend/commit/84a90ab9929b311127968fef4467c7b5198780f7)

Now, the SFS clones the estimator by default, which I think is the better, recommended default behavior. In your older version, the same knn object instance is used both inside and outside the SFS, which is probably responsible for the bug that you are having. I bet that it will work fine if you update the mlxtend version. Sorry for the trouble here!

hmf commented 8 years ago

Hm, sorry for the trouble ...

Please don't apologise - I appreciate you taking the time to help.

I just see that I made an update to the SFS in May indeed, in this case, it's a crucial one!

I have installed the dev version and it working! Thank you.

I would just like to clear up some doubts I still have. Sorry, but I am a little slow on the uptake.

Isn't the k_features parameter of SequentialFeatureSelector used to indicate the maximum number of final features allowed? In the documentation its states "Number of features to select, where k_features < the full feature set."

Yes, that's right. Actually, it's the exact number of features. If k_features=2, it will return exactly 2 features in the pipeline (not "up to 2 features", just to clarify :) )

I used the code above and kept the k_features = 4 is the SFS parameters. The result I get is :

{'sfs__estimator__algorithm': 'brute',
 'sfs__estimator__leaf_size': 5,
 'sfs__estimator__metric': 'seuclidean',
 'sfs__estimator__metric_params': {'V': array([ 0.685564,  0.167216,  3.070371,  0.601184])},
 'sfs__estimator__n_neighbors': 1,
 'sfs__estimator__weights': 'uniform',
 'sfs__k_features': 3}

Notice how 'sfs__k_features': 3. In fact I can set that SFS k_features parameter to any value (say 6) because what counts is the 'sfs__k_features' : list(range(1, X.shape[1])), as these are the values of k_features that will be used by SFS in the grid search. SFS select the best sub-set of features of a given size, GridSearchCV selects the best sub-set size (in this case 3 features). Is this correct?

So, maybe one more thing worth noting is that there are 2 different knn's here. The one inside the SFS and the one outside the SFS. Or in other words, one KNN estimator used for feature selection, and one used for the final classification. For example, if I modify the parameter grid as follows (see the last line):

sfs_seuclidean = {
    'sfs__k_features'              : list(range(1, X.shape[1])),
    'sfs__estimator__metric'       : ['seuclidean'],
    'sfs__estimator__metric_params': [ {'V':V} ],  # will be automatically calculated
    #'sfs__estimator__algorithm'    : ['ball_tree'] #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__algorithm'    : ['brute'], #,  # TODO , ['brute', 'ball_tree'],
     'sfs__estimator__n_neighbors'  : list(k_range),
     'sfs__estimator__weights'      : weights,
     'sfs__estimator__leaf_size'    : list(leaf_sizes),"" 
     'knn__n_neighbors'             : [1, 2, 3, 4],
}

So here are you saying that for each of the SFS's 540 fits the last knn will be executed for 4 different neighbour sizes (4x540 = 2160)? Doesn't the last knn simply used the best neighbour size from 'sfs__estimator__n_neighbors' : list(k_range), list? In't it simpy repeating the same calculations?

Once again thanks for the help.

rasbt commented 8 years ago

Notice how 'sfsk_features': 3. In fact I can set that SFS k_features parameter to any value (say 6) because what counts is the 'sfsk_features' : list(range(1, X.shape[1])), as these are the values of k_features that will be used by SFS in the grid search. SFS select the best sub-set of features of a given size, GridSearchCV selects the best sub-set size (in this case 3 features). Is this correct?

Yes! But prior to that, i.e., in your old version, you encountered this V-related dimensionality issue because you pre-computed the variance based on 4 features, then you used 3 features to feed it to the algo. Now, it doesn't matter anymore how you seed the k_features since you modify them during the grid search anyway; the k_features = 4 becomes just a "placeholder".

So here are you saying that for each of the SFS's 540 fits the last knn will be executed for 4 different neighbour sizes (4x540 = 2160)? Doesn't the last knn simply used the best neighbour size from 'sfsestimatorn_neighbors' : list(k_range), list? In't it simpy repeating the same calculations?

Yeah, that sounds correct. So basically, it works like this now:

  1. initialize pipepeline with: MinMaxScaler(), SFS with its own KNN, a different KNN classifier
    1. put training fold through standard scaler
    2. select k_features via SFS using its knn as "evaluator" with n_neighbors
    3. feed best k_features to the separate KNN classifier with n_neighbors
  2. Go back to step 2

Or maybe let's say we have a LogisticRegression classifier in the SFS to make thinks more clear

sfs1 = SFS(estimator=LogisticRegression(),
           k_features=4,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
                 ('sfs', sfs1),
                 ('knn', knn)])
  1. initialize pipepeline with: MinMaxScaler(), SFS withLogisticRegression
    1. put training fold through standard scaler
    2. select k_features via SFS using LogisticRegression with an e.g., specific regularization strength
    3. feed best k_features to the KNN classifier with n_neighbors
  2. Go back to step 2

Now, you can use the same classifier instance for both selection and classification, e.g.,


lr = LogisticRegression()

sfs1 = SFS(estimator=lr,
           clone=False,
           k_features=4,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
                 ('sfs', sfs1),
                 ('clf', lr)])

However, in your case, we have the problem then that the V dimensions from SFS don't match the classifier's in the pipeline and vice versa

hmf commented 8 years ago

The explanation above is clear, thank you. However, from you last sentence I can conclude that I cannot trust the the result I am getting because the V variance is not being used correctly. I did a little digging and found the source:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py

which led me to:

https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py

It seems like If I don't provide a V, one will be calculated automatically.

 elif mstr in ['seuclidean', 'se', 's']:
            X = _convert_to_double(X)
            if V is not None:
                V = np.asarray(V, order='c')
                if V.dtype != np.double:
                    raise TypeError('Variance vector V must contain doubles.')
                if len(V.shape) != 1:
                    raise ValueError('Variance vector V must '
                                     'be one-dimensional.')
                if V.shape[0] != n:
                    raise ValueError('Variance vector V must be of the same '
                            'dimension as the vectors on which the distances '
                            'are computed.')
                # The C code doesn't do striding.
                VV = _copy_array_if_base_present(_convert_to_double(V))

So I redid the test with the brute and ball_tree algorithms (seperately) without setting the Vvariable. This worked:

Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    1.6s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    6.7s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:   15.3s
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed:   18.4s finished

and produced the same results:

{'sfs__estimator__algorithm': 'ball_tree',
 'sfs__estimator__leaf_size': 5,
 'sfs__estimator__metric': 'seuclidean',
 'sfs__estimator__n_neighbors': 1,
 'sfs__estimator__weights': 'uniform',
 'sfs__k_features': 3}

Strangely enough I recall trying this before but it failed. I cannot remember what I did differently. Anyway problem solved, I have the solution.

Thank you for your help.

EDIT: maybe my initial tests, were I did not provide a V, failed because I was using an earlier version of mlxtend.