rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.85k stars 857 forks source link

Custom generator for models in exhaustive feature selector #833

Open jonathan-taylor opened 3 years ago

jonathan-taylor commented 3 years ago

Describe the workflow you want to enable

I'd like to make it easier to do best subsets with categorical features -- for simplicity let's start by assuming an additive model so for each feature there are a set of columns in the design matrix associated with that feature. When all are continuous features each feature is associated to a single column, otherwise there is a feature grouping that can be described as a sequence of length X.shape[1] assigning columns to a particular feature. More generally, this sequence assigning columns to features could also include interactions of both continuous and categorical variables.

Describe your proposed solution

It is (at least in some corners) common practice to include all columns associated to a categorical feature or none. This would be able to be encoded in the candidates list. If interactions were permitted then some conventions only include an interaction if both main effects are also included. While the logic of which candidates to generate may be user-specific, it would seem if we could supply a custom iterator for candidates then most of the code should not need to be modified. Instead of custom_names each particular candidate may have its own identifier, so one could specify whether the iterator produces simply indices or (indices, identifier) pairs.

This would remove the need for the min_features/max_features argument as this would be encoded into the iterator itself. So perhaps a helper functions to produce at least a few common iterators for candidates could be included. Specifically one which produce the default "all continuous" iterator, and one which could easily handle an additive model with possibly some categorical variables.

Describe alternatives you've considered, if relevant

I've considered simply wrapping R functions like regsubsets that easily handles the categorical variables. I would prefer an sklearn aware version that could do this as well.

Additional context

jonathan-taylor commented 3 years ago

Implemented a simple version here: https://github.com/rasbt/mlxtend/pull/834

It might also be nice to have the possibility of the sequential feature selector use custom logic as well. Again, when adding and deleting categorical variables or interactions one would want to add or delete groups of features at a time.

rasbt commented 3 years ago

Overall, this sounds like a great idea, and I would be in favor of such a solution for both the exhaustive and sequential feature selectors. Refactoring this into custom iterators seems like a very elegant solution. We would then have helper function that reproduces the current behavior along with generating iterators for the datasets with categorical variables.

With regard to identifying categorical features, there are many options, but what do you think of adopting the approach used in scikit-learn's HistGradientBoostingClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html)?

Sorry I am currently moving and may not be super responsive in the next 1-2 weeks, but I just wanted to say that your proposal would be a very nice feature.

jonathan-taylor commented 3 years ago

Let me take a look at the scikit learn example. On a little further reflection it seems possible to even do both sequential and exhaustive using almost identical code that generates candidates from a current "state". For exhaustive, the candidates would not depend on the state but just continue along an generator, while for sequential the set of candidates would depend on an updated state. State updates could be applied by applying a function to the returned scores from the previous set of candidates, i.e. sequential's next state would be the maximizing from the previous set of candidates.

Getting both done this way may be too ambitious to start. I will try to flesh out the exhaustive one first...

rasbt commented 2 years ago

Thanks a lot for the PR, this is very exciting!

Big picture-wise, there are a few thoughts.

1) What do we do with the existing ExhaustiveFeatureSelector and SequentialFeatureSelector? We could deprecate them, that is, remove them from the documentation but leave them in the code for a few versions / years.

2) If we do deprecate the existing SFS, two missing features would be floating-forward and floating-backward. I think right now, via

    for direction in ['forward', 'backward', 'both']:
         strategy = step(X,
                         direction=direction,
                         max_features=p,
                         fixed_features=[2,3],
                         categorical_features=categorical_features)

it only supports the standard forward and backward. I assume that 'both' means it runs forward first and finds the best set. Then it runs backward (independently) to find the best set. Then, the best set is determined by comparing the results from forward and backward? This is actually a neat addition.

3) If we deprecate and add the floating variants, I think the only thing that we need to ensure is that it still remains compatible with scikit-learn pipelines and maybe GridSearchCV.

Amazing work, though. What you put together here is really exciting and impressive!