rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.85k stars 857 forks source link

Let SequentialFeatureSelector within GridSearchCV evaluate on same test split #829

Open ptoews opened 3 years ago

ptoews commented 3 years ago

Describe the workflow you want to enable

I'm using SFS within GridSearch and as far as I understand it, GridSearch splits the data into train and test, where it fits the inner SFS on train and tests the fitted estimator on the test split. Therefore, the inner SFS trains and tests on the train split only, since it doesn't have access to the (outer) test split. I think it would be better if the same test split was used, especially if the dataset size is small and further splitting makes training more difficult.

Describe your proposed solution

My very hacky solution is currently to use a custom scoring function that ignores the given training data and instead uses the global test dataset, and then pass the indicies parameters (that describe the current feature subset) to the scoring function in this call here: https://github.com/rasbt/mlxtend/blob/2945485168744bbd254378aeda73e2d34ee19024/mlxtend/feature_selection/sequential_feature_selector.py#L38

This is very hacky but so far I couldn't think of a better solution. Above works and improves generalization performance significantly in my case.

rasbt commented 3 years ago

HI there,

I can see how that can be a limitation in grid search. Given that the current SFS is already relatively complicated and has maybe too many bells and whistles for deviating from scikit-learn's expected use, I am wondering if your solution could maybe an example we could add to the documentation rather than adding as another option to the parameter set?