Open apptimise opened 4 years ago
I understand that the RFECV uses CV only to determine the best k, not the set of features. The attributes returned are therefore correct, and reporting averages would be inappropriate when features have interactions (e.g. collinearity). Providing the rankings produced by each split's RFE would provide additional information about the stability of the model, but not a better model.
Thanks.
I understand that the RFECV uses CV only to determine the best k, not the set of features.
That's why I think providing the selected set of features or their rankings using the whole set can be confusing (because that's not what the algorithm does/is intended for)
Also, let's consider a voting-based approach. Why would it be wrong (even if collinearity exists)? consider 3 folds with the following ranking_
s:
[1 2 3 4]
[2 1 4 3]
[1 3 2 4]
The sum is: [4 6 9 11]
therefore, the true ranking_
can be considered as [1 2 3 4]
Now suppose that you fit on the whole training set and get something like [2 1 4 3]
(similar to fold-2). All the ranks will be different from what can be considered as an "average"/voting-based ranking.
what if the first two features are identical, but for random noise. They are also the most important feature. Because the two are redundant, my splits might rank:
[1, 4, 2, 3]
[4, 1, 2, 3]
[1, 4, 2, 3]
[4, 1, 2, 3]
Sum these...
[10, 10, 8, 12]
Not really useful.
RFE is very explicitly an alternative to univariate selection, that will take this conditional dependency into account.
And grid search, which is what's happening, usually works by learning hyperparameters under cv, but final model parameters with the full training set.
@jnothman I'm still not clear why it's not useful. The final rank translates to [2, 2, 1, 3]
which shows how features behaved across different folds. Although, all of these are hypothetical examples. So, I'm trying to see if I can generate some data to prove my point.
BTW, do you have a reference on why/how RFE
would help with situations such as multicollinearity?
Ok, here is an example where the features selected by a voting-based approach can lead to better r2
score, compared with features selected by current RFECV
approach. I still use RFECV
to decide about the number of features to choose, but then instead of fitting it on the whole set, I reduce the X
using the voting vector, and then compute the score:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
n_folds = 5
k_fold = KFold(n_splits=n_folds, shuffle=True, random_state=42)
rng = np.random.RandomState(seed=0)
X, y = make_friedman1(n_samples=50, n_features=20, random_state=0)
estimator = SVR(kernel="linear")
rfecv = RFECV(estimator, step=1, scoring="r2", cv=k_fold)
rfecv = rfecv.fit(X, y)
print("RFECV support_", rfecv.support_)
print("RFECV ranking_", rfecv.ranking_)
print("RFECV grid_scores_", rfecv.grid_scores_)
print("RFECV score", "%.4f" % rfecv.score(X, y), "\n\n")
# voting based
from sklearn.feature_selection import RFE
from sklearn import metrics
X_train, y_train = X.copy(), y.copy()
n_select = rfecv.n_features_
print("features to select: ", n_select)
rfe_pipe = Pipeline(
[
("rfe_select", RFE(estimator=estimator, n_features_to_select=n_select, step=1)),
("eval", estimator),
]
)
manual_cv_scores = []
votes = np.zeros(X.shape[1])
fold_count = 0
for train_ind, validation_ind in k_fold.split(X_train):
fold_count += 1
Xtr, Xval = X_train[train_ind, :], X_train[validation_ind, :]
ytr, yval = y_train[train_ind], y_train[validation_ind]
rfe_pipe.fit(Xtr, ytr)
print(
"Fold ",
fold_count,
"Selected Features: ",
rfe_pipe["rfe_select"].ranking_,
)
votes = votes + rfe_pipe["rfe_select"].ranking_
y_pred = rfe_pipe.predict(Xval)
manual_cv_scores.append(metrics.r2_score(yval, y_pred))
print("Votes:", votes)
print("Manual CV, rfe_pipe, scores: ", ["%.6f" % r2 for r2 in manual_cv_scores])
print(
"Mean CV (equals to",
rfecv.n_features_,
"th item in RFECV grid_scores_)",
"%.6f" % np.mean(manual_cv_scores),
)
estimator.fit(X[:, votes.argsort()[:n_select]], y)
ypred = estimator.predict(X[:, votes.argsort()[:n_select]])
print("voting_based score:", "%.4f" % metrics.r2_score(y, ypred))
RFECV score 0.6748 RFECV score 0.6947
It doesn't matter how features behave across different folds. It matters that the final model has the most informative features and minimum model size.
Ok, this keeps coming back to me. Now, imagine we would like to use grid search across folds for model selection (hyperparameter tuning) or checking overfitting. I guess we can agree that there difference across folds need to be taken into account. I just wanna know how you would address the problem when performing a grid search.
I guess using RFE
in a Pipeline
and then performing grid search on the pipe will not result in an apple to apple comparison, because again, in each fold, selected features might be different. So what would you do?
What I can think of is to do RFECV
first, get the selected features on the whole training set and then create a custom feature selector in my pipeline which selects those features. This way, the selected features in the grid search will be the same across folds. I suumarize it:
Aim: model selection using grid search or comparing train-test scores using cross validation
RFECV -> select the best features for the whole training set
Create custome_feature_selector to select the above features
pipe: preprocessor - custom_feature_selector - predictor
GridSearchCV(pipe,...)
What do you think?
Answering the discussion in https://github.com/scikit-learn/scikit-learn/discussions/20976, I came to this issue.
I think that I am in line with the @jnothman and this particular comments: https://github.com/scikit-learn/scikit-learn/issues/17782#issuecomment-651502519
Now that we have a cv_results_
dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.
Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal k
.
Answering the discussion in #20976, I came to this issue.
I think that I am in line with the @jnothman and this particular comments: #17782 (comment)
Now that we have a
cv_results_
dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal
k
.
It would be awesome to have the support for each split! Is it planned anytime in the future?
It would be awesome to have the support for each split! Is it planned anytime in the future?
I'm not aware of anyone working on it but it looks like a good enhancement to be done. @MarieS-WiMLDS do you want to implement this feature?
Describe the workflow you want to enable
Currently,
RFECV
searches for the optimal number of features to provide the best score. It then fits anRFE
on the whole training set: L566So, the score is the cross-validated score, but the
ranking_
is for the whole set of data. Isn't this misleading? e.g. withRFE
and 3-fold cross-validation, I get theseranking_
s:while
RFECV
returns something like:which is the
ranking_
RFE
would provide if fit over the whole set. So, the score is a cross-validated score, while theranking_
andsupport_
are not.Describe your proposed solution
Imagine a situation where you get the best performance with
k
features, but thek
features in some folds are different from thek
features you would get by fitting the model on the whole training set. Wouldn't it be a better idea to return a form of average/votingbased `rankingand
support_`?