ValueError: all the input array dimensions except for the concatenation axis must match exactly

goerch commented 4 years ago

The following cell from my notebook

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFpr, SelectFdr, SelectFwe
from sklearn.preprocessing import FunctionTransformer, StandardScaler, MaxAbsScaler, RobustScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from stability_selection import StabilitySelection
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
vt = VarianceThreshold()
scl = RobustScaler(with_centering=False)
lr = LogisticRegression(random_state=42, solver='lbfgs', C=.05, max_iter=10000, n_jobs=-1, verbose=1)
pipe = Pipeline([('vt', vt), ('scl', scl), ('lr', lr)])
clf = StabilitySelection(base_estimator=pipe, lambda_name='lr__C', lambda_grid=np.array([.075, .05, .025]), n_bootstrap_iterations=3)

gives the output

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   34.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   32.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   32.9s finished

and then the traceback

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-f62a92b4924d> in <module>
     38 clf = StabilitySelection(base_estimator=pipe, lambda_name='lr__C', lambda_grid=np.array([.075, .05, .025]), n_bootstrap_iterations=3)
     39 
---> 40 clf.fit(X_train, y_train)
     41 #print(lr.C_)
     42 print(pd.Series(features)[vt.get_support(indices=True)[sel.get_support(indices=True)[clf.get_support()]]].sort_values().tail(60))

/opt/conda/lib/python3.6/site-packages/stability_selection/stability_selection.py in fit(self, X, y)
    344               for subsample in bootstrap_samples)
    345 
--> 346             stability_scores[:, idx] = np.vstack(selected_variables).mean(axis=0)
    347 
    348         self.stability_scores_ = stability_scores

/opt/conda/lib/python3.6/site-packages/numpy/core/shape_base.py in vstack(tup)
    281     """
    282     _warn_for_nonsequence(tup)
--> 283     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    284 
    285 

ValueError: all the input array dimensions except for the concatenation axis must match exactly

X_train is a scipy crs_matrix. Is this expected behaviour?

goerch commented 4 years ago

Monkey patching StabilitySelection with

from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.utils import check_array, check_random_state, check_X_y, safe_mask
from sklearn.externals.joblib import Parallel, delayed
from sklearn.feature_selection import SelectFromModel

def _return_estimator_from_pipeline(pipeline):
    """Returns the final estimator in a Pipeline, or the estimator
    if it is not"""
    if isinstance(pipeline, Pipeline):
        return pipeline._final_estimator
    else:
        return pipeline

def _bootstrap_generator(n_bootstrap_iterations, bootstrap_func, y,
                         n_subsamples, random_state=None):
    for _ in range(n_bootstrap_iterations):
        subsample = bootstrap_func(y, n_subsamples, random_state)
        if isinstance(subsample, tuple):
            for item in subsample:
                yield item
        else:
            yield subsample

def _fit_bootstrap_sample(base_estimator, X, y, lambda_name, lambda_value,
                          threshold=None):
    """
    Fits base_estimator on a bootstrap sample of the original data,
    and returns a mas of the variables that are selected by the fitted model.

    Parameters
    ----------
    base_estimator : Estimator
        Estimator to be fitted on the data

    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        The training input samples.

    y : array-like, shape = [n_samples]
        The target values.

    lambda_name : str
        Name of the penalization parameter of base_estimator

    lambda_value : float
        Value of the penalization parameter

    threshold : string, float, optional default None
        The threshold value to use for feature selection. Features whose
        importance is greater or equal are kept while the others are
        discarded. If "median" (resp. "mean"), then the ``threshold`` value is
        the median (resp. the mean) of the feature importances. A scaling
        factor (e.g., "1.25*mean") may also be used. If None and if the
        estimator has a parameter penalty set to l1, either explicitly
        or implicitly (e.g, Lasso), the threshold used is 1e-5.
        Otherwise, "mean" is used by default.

    Returns
    -------
    selected_variables : array-like, shape = [n_features]
        Boolean mask of selected variables.
    """

    base_estimator.set_params(**{lambda_name: lambda_value})
    base_estimator.fit(X, y)

    # TODO: Reconsider if we really want to use SelectFromModel here or not
    selector_model = _return_estimator_from_pipeline(base_estimator)
    variable_selector = SelectFromModel(estimator=selector_model,
                                        threshold=threshold,
                                        prefit=True)
    return variable_selector.get_support()

def fit(self, X, y):
    """Fit the stability selection model on the given data.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        The training input samples.

    y : array-like, shape = [n_samples]
        The target values.
    """

    self._validate_input()

    X, y = check_X_y(X, y, accept_sparse='csr')

    n_samples, n_variables = X.shape
    n_subsamples = np.floor(self.sample_fraction * n_samples).astype(int)
    n_lambdas = self.lambda_grid.shape[0]

    base_estimator = clone(self.base_estimator)
    random_state = check_random_state(self.random_state)
    stability_scores = np.zeros((n_variables, n_lambdas))

    for idx, lambda_value in enumerate(self.lambda_grid):
        if self.verbose > 0:
            print("Fitting estimator for lambda = %.5f (%d / %d) on %d bootstrap samples" %
                  (lambda_value, idx + 1, n_lambdas, self.n_bootstrap_iterations))

        bootstrap_samples = _bootstrap_generator(self.n_bootstrap_iterations,
                                                 self.bootstrap_func, y,
                                                 n_subsamples, random_state=random_state)

        selected_variables = Parallel(
            n_jobs=self.n_jobs, verbose=self.verbose,
            pre_dispatch=self.pre_dispatch
        )(delayed(_fit_bootstrap_sample)(clone(base_estimator),
                                         X=X[safe_mask(X, subsample), :],
                                         y=y[subsample],
                                         lambda_name=self.lambda_name,
                                         lambda_value=lambda_value,
                                         threshold=self.bootstrap_threshold)
          for subsample in bootstrap_samples)

        for i in range(0, self.n_bootstrap_iterations):
            print("Sample ", i, " selected ", selected_variables[i].shape)
        stability_scores[:, idx] = np.vstack(selected_variables).mean(axis=0)

    self.stability_scores_ = stability_scores
    return self

StabilitySelection.fit = fit

the debug output for the notebook is

Sample  0  selected  (30969,)
Sample  1  selected  (31011,)
Sample  2  selected  (30959,)

which indicates _fit_bootstrap_sample doesn't work with the feature selection of the notebook pipeline.

thuijskens commented 4 years ago

Hi @goerch - thanks for raising this issue.

It is hard for me to debug this without a minimal reproducible example (as I don't have access to your data).

Looking at your code, I wonder if the VarianceThreshold step is the cause for the different shapes in the output of _fit_bootstrap_sample. Depending on the bootstrap sample, this transformer may drop different variables which will result in a different data matrix that is passed to SelectFromModel. Perhaps you can log the shapes of the input and output data to VarianceThreshold and compare that to the shape of selected_variables.

goerch commented 4 years ago

Hi @thuijskens,

your assessment is correct. Commit https://github.com/goerch/stability-selection/commit/940086601627fc09bf8b9bfdf2ba397429faff97 contains a failing reproduction.

goerch commented 4 years ago

Hi @thuijskens,

commit https://github.com/goerch/stability-selection/commit/37ac9be160f53f8ce393ad6baa1496bbac6fcb2c fixes the failing test case. But I'm not certain if it is even possible to support all different sklearn pipeline configurations.

scikit-learn-contrib / stability-selection

ValueError: all the input array dimensions except for the concatenation axis must match exactly #29