RFECV to provide the average ranking_ and support_

apptimise commented 4 years ago

Describe the workflow you want to enable

Currently, RFECV searches for the optimal number of features to provide the best score. It then fits an RFE on the whole training set: L566

So, the score is the cross-validated score, but the ranking_ is for the whole set of data. Isn't this misleading? e.g. with RFE and 3-fold cross-validation, I get these ranking_s:

[1 1 4 1 1 2 3 5 7 6]
[1 1 2 1 1 7 4 3 5 6]
[1 1 2 1 1 7 5 6 3 4]

while RFECV returns something like:

[1 1 3 1 1 5 2 4 7 6]

which is the ranking_ RFE would provide if fit over the whole set. So, the score is a cross-validated score, while the ranking_ and support_ are not.

Describe your proposed solution

Imagine a situation where you get the best performance with k features, but the k features in some folds are different from the k features you would get by fitting the model on the whole training set. Wouldn't it be a better idea to return a form of average/votingbased `rankingandsupport_`?

jnothman commented 4 years ago

I understand that the RFECV uses CV only to determine the best k, not the set of features. The attributes returned are therefore correct, and reporting averages would be inappropriate when features have interactions (e.g. collinearity). Providing the rankings produced by each split's RFE would provide additional information about the stability of the model, but not a better model.

apptimise commented 4 years ago

Thanks.

I understand that the RFECV uses CV only to determine the best k, not the set of features.

That's why I think providing the selected set of features or their rankings using the whole set can be confusing (because that's not what the algorithm does/is intended for)

Also, let's consider a voting-based approach. Why would it be wrong (even if collinearity exists)? consider 3 folds with the following ranking_s:

[1 2 3 4]
[2 1 4 3]
[1 3 2 4]

The sum is: [4 6 9 11] therefore, the true ranking_ can be considered as [1 2 3 4] Now suppose that you fit on the whole training set and get something like [2 1 4 3] (similar to fold-2). All the ranks will be different from what can be considered as an "average"/voting-based ranking.

jnothman commented 4 years ago

what if the first two features are identical, but for random noise. They are also the most important feature. Because the two are redundant, my splits might rank:

[1, 4, 2, 3]
[4, 1, 2, 3]
[1, 4, 2, 3]
[4, 1, 2, 3]

Sum these...

[10, 10, 8, 12]

Not really useful.

RFE is very explicitly an alternative to univariate selection, that will take this conditional dependency into account.

And grid search, which is what's happening, usually works by learning hyperparameters under cv, but final model parameters with the full training set.

apptimise commented 4 years ago

@jnothman I'm still not clear why it's not useful. The final rank translates to [2, 2, 1, 3] which shows how features behaved across different folds. Although, all of these are hypothetical examples. So, I'm trying to see if I can generate some data to prove my point.
BTW, do you have a reference on why/how RFE would help with situations such as multicollinearity?

apptimise commented 4 years ago

Ok, here is an example where the features selected by a voting-based approach can lead to better r2 score, compared with features selected by current RFECV approach. I still use RFECV to decide about the number of features to choose, but then instead of fitting it on the whole set, I reduce the X using the voting vector, and then compute the score:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline

n_folds = 5
k_fold = KFold(n_splits=n_folds, shuffle=True, random_state=42)

rng = np.random.RandomState(seed=0)
X, y = make_friedman1(n_samples=50, n_features=20, random_state=0)

estimator = SVR(kernel="linear")
rfecv = RFECV(estimator, step=1, scoring="r2", cv=k_fold)
rfecv = rfecv.fit(X, y)
print("RFECV support_", rfecv.support_)
print("RFECV ranking_", rfecv.ranking_)
print("RFECV grid_scores_", rfecv.grid_scores_)
print("RFECV score", "%.4f" % rfecv.score(X, y), "\n\n")

# voting based
from sklearn.feature_selection import RFE
from sklearn import metrics

X_train, y_train = X.copy(), y.copy()

n_select = rfecv.n_features_
print("features to select: ", n_select)
rfe_pipe = Pipeline(
    [
        ("rfe_select", RFE(estimator=estimator, n_features_to_select=n_select, step=1)),
        ("eval", estimator),
    ]
)

manual_cv_scores = []
votes = np.zeros(X.shape[1])
fold_count = 0
for train_ind, validation_ind in k_fold.split(X_train):
    fold_count += 1
    Xtr, Xval = X_train[train_ind, :], X_train[validation_ind, :]
    ytr, yval = y_train[train_ind], y_train[validation_ind]
    rfe_pipe.fit(Xtr, ytr)
    print(
        "Fold ",
        fold_count,
        "Selected Features:         ",
        rfe_pipe["rfe_select"].ranking_,
    )
    votes = votes + rfe_pipe["rfe_select"].ranking_
    y_pred = rfe_pipe.predict(Xval)
    manual_cv_scores.append(metrics.r2_score(yval, y_pred))

print("Votes:", votes)
print("Manual CV, rfe_pipe, scores:    ", ["%.6f" % r2 for r2 in manual_cv_scores])
print(
    "Mean CV (equals to",
    rfecv.n_features_,
    "th item in RFECV grid_scores_)",
    "%.6f" % np.mean(manual_cv_scores),
)
estimator.fit(X[:, votes.argsort()[:n_select]], y)
ypred = estimator.predict(X[:, votes.argsort()[:n_select]])
print("voting_based score:", "%.4f" % metrics.r2_score(y, ypred))

RFECV score 0.6748 RFECV score 0.6947

jnothman commented 4 years ago

It doesn't matter how features behave across different folds. It matters that the final model has the most informative features and minimum model size.

apptimise commented 4 years ago

Ok, this keeps coming back to me. Now, imagine we would like to use grid search across folds for model selection (hyperparameter tuning) or checking overfitting. I guess we can agree that there difference across folds need to be taken into account. I just wanna know how you would address the problem when performing a grid search.

I guess using RFE in a Pipeline and then performing grid search on the pipe will not result in an apple to apple comparison, because again, in each fold, selected features might be different. So what would you do?

What I can think of is to do RFECV first, get the selected features on the whole training set and then create a custom feature selector in my pipeline which selects those features. This way, the selected features in the grid search will be the same across folds. I suumarize it:

Aim: model selection using grid search or comparing train-test scores using cross validation

RFECV -> select the best features for the whole training set
Create custome_feature_selector to select the above features
pipe: preprocessor - custom_feature_selector - predictor
GridSearchCV(pipe,...)

What do you think?

glemaitre commented 3 years ago

Answering the discussion in https://github.com/scikit-learn/scikit-learn/discussions/20976, I came to this issue.

I think that I am in line with the @jnothman and this particular comments: https://github.com/scikit-learn/scikit-learn/issues/17782#issuecomment-651502519

Now that we have a cv_results_ dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.

Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal k.

MarieS-WiMLDS commented 1 week ago

Answering the discussion in #20976, I came to this issue.

I think that I am in line with the @jnothman and this particular comments: #17782 (comment)

Now that we have a cv_results_ dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.

Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal k.

It would be awesome to have the support for each split! Is it planned anytime in the future?

glemaitre commented 1 week ago

It would be awesome to have the support for each split! Is it planned anytime in the future?

I'm not aware of anyone working on it but it looks like a good enhancement to be done. @MarieS-WiMLDS do you want to implement this feature?

scikit-learn / scikit-learn

RFECV to provide the average ranking_ and support_ #17782

Describe the workflow you want to enable

Describe your proposed solution