rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.85k stars 857 forks source link

SequentialFeatureSelection Early Stopping Criterion #886

Open aldder opened 2 years ago

aldder commented 2 years ago

Description

According to some studies reported on papers (like this: https://www.researchgate.net/publication/220804144_Overfitting_in_Wrapper-Based_Feature_Subset_Selection_The_Harder_You_Try_the_Worse_it_Gets), the feature selection methodologies known as Wrapper suffer from overfitting as the number of explored states increases. A method to reduce this overfitting is to use automatic stop criteria (early stop, as the one most known for neural networks). In this PR I have implemented the criterion of early stopping for the class SequentialFeatureSelector.

One parameter has been added in during instantiation of the object:

early_stop_rounds : int (default 0)
    Enable early stopping criterion when > 0, this value determines the
    number of iterations after which, if no performance boost has been
    seen, execution is stopped.
    Used only when `k_features == 'best'` or `k_features == 'parsimonious'`

Code Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

np.random.seed(0)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# add some noise in order to have features to discard
X_iris_with_noise = np.concatenate(
    (X_iris,
    np.random.randn(X_iris.shape[0], X_iris.shape[1])),
    axis=1)

knn = KNeighborsClassifier()
sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=0,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

1

sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=2,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

... Performances not improved for 2 rounds. Stopping now!

2

Pull Request Checklist

rasbt commented 2 years ago

Thanks for the PR! I agree that overfitting can become an issue. Currently, there is the option

which will select the smallest feature set within 1 standard error of the best feature set, which helps with this.

I like adding an early_stop option, but I have a few suggestions / concerns regarding the API:

1)

I think that the two parameters

early_stop and early_stop_rounds can be consolidated into a single one. E.g.,


                 if self.early_stop and k != k_to_select:
                     if k_score <= best_score:

could be

                 if self.early_stop_rounds and k != k_to_select:
                     if k_score <= best_score:

What I mean is instead of having

this could be simplified to

2)

The second concern I have is that if a user selects e.g., k_features=(1, 3), early_stop_rounds=3, it's not necessarily guaranteed that there will be 1 to 3 features, which can be confusing.

I wonder if it makes sense to allow early_stopping_rounds only for k_features='best' and k_features='parsimonious', which both explore the whole feature subset size space.

E.g.,

What are your thoughts?

aldder commented 2 years ago

Thanks for your suggestions, I agree with your points. I will edit this PR with the followings:

rasbt commented 2 years ago

Sounds great! Looking forward to it!

pep8speaks commented 2 years ago

Hello @aldder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2022-02-03 10:37:22 UTC
jimmy927 commented 10 months ago

What is the status on this ?

I use the k_features="parsimonious" on my model. But it continues to add more and more features even after it is obvious the model will not improve, and after that it will select one of the very early model anywys.

I think this PR could get my runtime from 10 days into hours ;-)

rasbt commented 10 months ago

Thanks for the ping, and I need to look into this some time -- sorry, haven't had a chance recently due to too many other commitments. Unfortunately, I currently don't have a timeline for this.