rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.91k stars 871 forks source link

Sequential Feature Selection for categorical features without one-hot encoding #502

Closed polishtits closed 2 years ago

polishtits commented 5 years ago

Dr. Raschka, Thank you for all the wonderful work! Truly amazing library!

I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always. The output of SFS should not output, say, 2 columns of the 3 encoded columns. Would you please let me know what your opinion is on this matter?

rasbt commented 5 years ago

I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always.

Yes, I agree with you. The problem is that this is more of a technical limitation (i.e., how can you encode the information of which columns belong together?).

Maybe the most general solution would be to perform the transformation from categorical to onehot encoded features after the selection step? I.e., your classifier itself could be a scikit-learn pipeline whereas the first element is a column-transformer that expands the categorical feature into a onehot encoded feature set.

I think this should work, and maybe we could add an example to the documentation.

polishtits commented 5 years ago

Thank you for your quick response Dr. Raschka.

One casual way to work around this issue that I can think of, in case we really want to consider onehot encoded features during SFS, is to simply label these columns that are generated after transformation. I guess this process can be automated since these columns are related to the unique values within a categorical feature.

rasbt commented 5 years ago

Yes, that would be one solution. However, the results would be different.

E.g. let's assume we have three features, A and B, C, where A is a categorical feature with 3 possible values (0, 1, 2). Let's call these onehot feature columns A_0, A_1, A_2

If i say select 2 features on the original DataFrame, it could select A, B or A, C an so forth. On the onehot encoded DataFrame the selection can be different... E.g., the outcome could be A_0, A_1.

So, instead of transforming A into a one hot representation and doing the feature selection on it, a pipeline could be used. For example, the classifier for the feature selector could be a pipeline with elements [onehot -> classifier]. So, the feature selector would still consider the features as but would do the onehot encoding temporarily only. E.g.,

polishtits commented 5 years ago

I totally agree. My approach is way sloppier than yours. Thank you for your suggestion!

rasbt commented 5 years ago

I actually tried that the other day with a scikit-learn Pipeline, OneHotEncoder, and ColumnTransformer. The technical limitation with using NumPy arrays is that we can't rely on column indices because they may refer to different features when we extend / shrink the subsets. One solution would be to use column names via pandas DataFrames. However, here the limitation is that while the SFS currently accepts DataFrames, it internally converts them to numpy arrays -- so the column name advantage is lost.

However, with #506, this could potentially be addressed! :)

rasbt commented 5 years ago

So, what I had in mind the other day was something like this

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

class GetDummies():

    def __init__(self, columns):
        self.columns = columns

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        return pd.get_dummies(pd.DataFrame(X), columns=self.columns)

    def fit(self, X, y=None):
        return self

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)

y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'categorical'])

X_df['categorical'] = y.astype(float)

######

from sklearn.pipeline import make_pipeline

get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)

sfs1 = SFS(pipe, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X_df, y_series)

Currently, this doesn't work, unfortunately, since the SFS passes around numpy arrays (and only keeps track of pandas feature column names internally)

polishtits commented 5 years ago

Yes, I agree. Currently SFS converts the whole feature vectors X, if it is a DataFrame object, into numpy array and treats them independently. And I genuinely was not aware of pd.get_dummies. Quite neat indeed!

mckennapep commented 3 years ago

Hi Dr. Raschka,

I am using feature selection on a dataset with categorical variables, and I came across this thread. I saw that you said this issue may be addressed with #506. Is this working now? The sample code you wrote above is exactly what I need, if I were able to call the features by column name. Have you figured out a way to work around it?

rasbt commented 3 years ago

Hi there.

Unfortunately, I don't think #506 fully addressed this issue, so this is not supported yet.

mengyujackson121 commented 3 years ago

This limitation is really difficult to find out about: I had created a Pipeline with OneHotEncoding and tried to use SequentialFeatureSelector on the whole pipeline and the error messages were very unhelpful.

In the meantime, before this is fully supported, could the error messages be improved at all?

rasbt commented 3 years ago

I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach.

Actually, thinking about this again, I believe the previously proposed pipeline approach actually works if we tweak it a little bit. The solution I proposed above would not work because the column proposed for one-hot encoding might not be present due to feature selection, and then it attempts to transform a non-present column and crashes.

I believe this can be easily fixed by (1) checking which columns are actually candidates for one-hot encoding via "set intersection", and then (2) we can encode only those columns that are present in the current iteration.

I believe the following should work:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

class GetDummies():

    def __init__(self, columns):
        self.columns = set(columns)

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        df = pd.DataFrame(X)
        intersect = self.columns.intersection(df.columns)

        return pd.get_dummies(pd.DataFrame(X), columns=intersect)

    def fit(self, X, y=None):
        return self

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)

y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'categorical'])

X_df['categorical'] = y.astype(float)

######

from sklearn.pipeline import make_pipeline

get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)

sfs1 = SFS(pipe, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X_df, y_series)

Please let me know if this solves your usecase. If yes, I am happy to add it to the documentation.

twbrandon7 commented 3 years ago

I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach. ......

Hi, I have tried the code above, however it doesn't work in my environment. The version of the packages I installed are as follows:

I find that df.columns in the transform() method doesn't contain the original column names. In contrast, df.columns is a RangeIndex() object, so the dummy columns doesn't generate as expected.

class GetDummies():

    # ......

    def transform(self, X, y=None):
        df = pd.DataFrame(X)

        # type(df.columns) == pandas.core.indexes.range.RangeIndex
        intersect = self.columns.intersection(df.columns)

        return pd.get_dummies(pd.DataFrame(X), columns=intersect)

    # ......

In my case, the type of the nominal attributes is string. Based on the tutorial from sklearn, I re-write the code to encode the nominal attributes by using one-hot encoding.

import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector

class DfConverter():
    def __init__(self):
        super().__init__()

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        df = pd.DataFrame(X)

        # automatically determine the data type for each columns
        df = df.convert_dtypes()        

        return df

    def fit(self, X, y=None):
        return self

def get_pipeline(model_provider):
    df_converter = DfConverter()

    categorical_transformer = OneHotEncoder(handle_unknown='ignore')
    preprocessor = ColumnTransformer(transformers=[
        # one-hot encode the nominal (string) attributes
        ('dynamic_cat', categorical_transformer, make_column_selector(dtype_include="string"))
    ], remainder="passthrough")

    clf = Pipeline(steps=[
        ('df_converter', df_converter),
        ('preprocessor', preprocessor),
        ('classifier', model_provider())
    ])

    return clf

# usage

## loading data
df_train = pd.read_csv("./data/train.csv")
df_test = pd.read_csv("./data/test.csv")

selected_features = [ ...... ]
x_train = df_train[selected_features]
x_test = df_test[selected_features]
y_train = df_train["label"]
y_test = df_test["label"]

## training
knn = get_pipeline(lambda: KNeighborsClassifier(n_jobs=-1))
sfs1 = SequentialFeatureSelector(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=4)
sfs1 = sfs1.fit(x_train, y_train)

Hope this helps. πŸ˜€

rasbt commented 3 years ago

Thanks for sharing! This would be another nice addition for the tutorials.

NimaSarajpoor commented 2 years ago

@rasbt So, I have been working on a data that has 190 features, and each feature itself is a timeseries (like a 3D tabular data, where rows are samples, columns are features, and the depths are time series). So, I converted each time series into 12 features (for some reasons, I had to do feature engineering). Now, I have $190 \times 12$ features. I am thinking of doing feature selection but considering group feature selection.

Can't we do features_group similar to what we did in Exhaustive Feature Selection? I mean it should solve the one-hot-encoding problem mentioned at the top of this issue.

So, for instance, if my groups are [[0, 1, 2], [3], [4, 5]], then my high_level_indices is [0, 1, 2]. So, I can iterate through high-level indices only. Does that work?


We might say if this parameter is provided, then non-float type is not supported (?!) . So, user should take care of preprocessing in this case.


UPDATE: I checkout a recent PR and it seems @rasbt has this idea that this task might be doable. (see PR #957 )

rasbt commented 2 years ago

Hey @NimaSarajpoor, you are right the sequential feature selector could (/should) eventually have a feature group support similar to the exhaustive feature selector. It’s something I was hoping to tackle eventually some time this year when time permits (was holding of with a new release until I get to this, because it would be nice to have a release that rolls this feature out for all three: sequential feature selector, exhaustive feature selector and feature importance permutations. The code base of the sequential feature selector is a bit more complicated. Personally I am also traveling the next two weeks and likely on mobile only. If someone in this thread is interested in tackling this that would be awesome of course :)

NimaSarajpoor commented 2 years ago

Cool. I can definitely work on this. I might be a little bit slow due to my current workload but I hope I can get it done in a reasonable time.

rasbt commented 2 years ago

Thanks and no worries about the timeline at all! Currently so many things to catch up with πŸ˜…

NimaSarajpoor commented 2 years ago

@rasbt You may want to close this.

rasbt commented 2 years ago

Good call