scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.57k stars 25.3k forks source link

Another input needed for the parameter `n_features_to_select` in SequentialFeatureSelector #21291

Open hellojinwoo opened 2 years ago

hellojinwoo commented 2 years ago

Describe the workflow you want to enable

Currently, to use the SequentialFeatureSelector, you need a parameter n_features_to_select. However, according to the book 'Introduction to statistical learning', you can know how many number of variables are appropriate only after you test all number of parameters and get the best model of each number of parameters.

image

This is an excerpt from the book ISLR, which shows that you can figure out how many number of features are appropriate after testing all numbers of predictors. You cannot figure it out beforehand.

Describe your proposed solution

I suggest to create options like "the lowest adjusted r_squared" for the parameter n_features_to_select. By doing so, you can choose the number of features that has the lowest adjusted r_squared, which cannot be known beforehand.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

bmreiniger commented 2 years ago

Posted at https://datascience.stackexchange.com/q/102964/55122

See also https://github.com/scikit-learn/scikit-learn/issues/20137, https://github.com/scikit-learn/scikit-learn/issues/19583.

thomasjpfan commented 2 years ago

Closing because this is a duplicate of #20137. Note that there is on-going work at https://github.com/scikit-learn/scikit-learn/pull/20145 that will resolve the issue.

bmreiniger commented 2 years ago

@thomasjpfan this would be a little different if the scores-vs-number-of-features graph isn't convex: this proposal would order all of the features and then choose the best number, which might not be at the first turnaround. That said, it's not clear how common it would be that later additions/deletions would be sufficiently better to be worth the extra processing time.

With the work in #20145, a user could get this by setting tol=-np.inf and then manually set the number of features (would changing the n_features_to_select_ learned attribute have the desired effect when transforming?). I'd imagine, if it was deemed worth it, we could add an option n_features_to_select="best" without too much trouble after 20145.

hellojinwoo commented 2 years ago

@bmreiniger As you said, checking out the estimated test score(e.g. adjusted R Squared, AIC, BIC, etc. ) of all number of features and then deciding which number of features to use, it is not a duplicate of #20137. #20137 talks about something like "continuing to select the features as long as AIC gets better". However, this methodology is based on the assumption that you cannot solely rely on one estimated test score to choose which number of features should be selection. So as long as training SSE decreases, the feature selection goes on and on till all number of features is used, and then decides which number of features to use, by taking a look at several estimated test scores.

R supports this function so it would be great if we see this in SKlearn as well :)