Closed polishtits closed 2 years ago
I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always.
Yes, I agree with you. The problem is that this is more of a technical limitation (i.e., how can you encode the information of which columns belong together?).
Maybe the most general solution would be to perform the transformation from categorical to onehot encoded features after the selection step? I.e., your classifier itself could be a scikit-learn pipeline whereas the first element is a column-transformer that expands the categorical feature into a onehot encoded feature set.
I think this should work, and maybe we could add an example to the documentation.
Thank you for your quick response Dr. Raschka.
One casual way to work around this issue that I can think of, in case we really want to consider onehot encoded features during SFS, is to simply label these columns that are generated after transformation. I guess this process can be automated since these columns are related to the unique values within a categorical feature.
Yes, that would be one solution. However, the results would be different.
E.g. let's assume we have three features, A and B, C, where A is a categorical feature with 3 possible values (0, 1, 2). Let's call these onehot feature columns A_0, A_1, A_2
If i say select 2 features on the original DataFrame, it could select A, B or A, C an so forth. On the onehot encoded DataFrame the selection can be different... E.g., the outcome could be A_0, A_1.
So, instead of transforming A into a one hot representation and doing the feature selection on it, a pipeline could be used. For example, the classifier for the feature selector could be a pipeline with elements [onehot -> classifier]. So, the feature selector would still consider the features as but would do the onehot encoding temporarily only. E.g.,
I totally agree. My approach is way sloppier than yours. Thank you for your suggestion!
I actually tried that the other day with a scikit-learn Pipeline, OneHotEncoder, and ColumnTransformer. The technical limitation with using NumPy arrays is that we can't rely on column indices because they may refer to different features when we extend / shrink the subsets. One solution would be to use column names via pandas DataFrames. However, here the limitation is that while the SFS currently accepts DataFrames, it internally converts them to numpy arrays -- so the column name advantage is lost.
However, with #506, this could potentially be addressed! :)
So, what I had in mind the other day was something like this
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
class GetDummies():
def __init__(self, columns):
self.columns = columns
def fit_transform(self, X, y=None):
return self.transform(X=X, y=y)
def transform(self, X, y=None):
return pd.get_dummies(pd.DataFrame(X), columns=self.columns)
def fit(self, X, y=None):
return self
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal width', 'categorical'])
X_df['categorical'] = y.astype(float)
######
from sklearn.pipeline import make_pipeline
get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)
sfs1 = SFS(pipe,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X_df, y_series)
Currently, this doesn't work, unfortunately, since the SFS passes around numpy arrays (and only keeps track of pandas feature column names internally)
Yes, I agree. Currently SFS converts the whole feature vectors X, if it is a DataFrame object, into numpy array and treats them independently. And I genuinely was not aware of pd.get_dummies. Quite neat indeed!
Hi Dr. Raschka,
I am using feature selection on a dataset with categorical variables, and I came across this thread. I saw that you said this issue may be addressed with #506. Is this working now? The sample code you wrote above is exactly what I need, if I were able to call the features by column name. Have you figured out a way to work around it?
Hi there.
Unfortunately, I don't think #506 fully addressed this issue, so this is not supported yet.
This limitation is really difficult to find out about: I had created a Pipeline with OneHotEncoding and tried to use SequentialFeatureSelector on the whole pipeline and the error messages were very unhelpful.
In the meantime, before this is fully supported, could the error messages be improved at all?
I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach.
Actually, thinking about this again, I believe the previously proposed pipeline approach actually works if we tweak it a little bit. The solution I proposed above would not work because the column proposed for one-hot encoding might not be present due to feature selection, and then it attempts to transform a non-present column and crashes.
I believe this can be easily fixed by (1) checking which columns are actually candidates for one-hot encoding via "set intersection", and then (2) we can encode only those columns that are present in the current iteration.
I believe the following should work:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
class GetDummies():
def __init__(self, columns):
self.columns = set(columns)
def fit_transform(self, X, y=None):
return self.transform(X=X, y=y)
def transform(self, X, y=None):
df = pd.DataFrame(X)
intersect = self.columns.intersection(df.columns)
return pd.get_dummies(pd.DataFrame(X), columns=intersect)
def fit(self, X, y=None):
return self
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal width', 'categorical'])
X_df['categorical'] = y.astype(float)
######
from sklearn.pipeline import make_pipeline
get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)
sfs1 = SFS(pipe,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X_df, y_series)
Please let me know if this solves your usecase. If yes, I am happy to add it to the documentation.
I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach. ......
Hi, I have tried the code above, however it doesn't work in my environment. The version of the packages I installed are as follows:
I find that df.columns
in the transform()
method doesn't contain the original column names. In contrast, df.columns
is a RangeIndex()
object, so the dummy columns doesn't generate as expected.
class GetDummies():
# ......
def transform(self, X, y=None):
df = pd.DataFrame(X)
# type(df.columns) == pandas.core.indexes.range.RangeIndex
intersect = self.columns.intersection(df.columns)
return pd.get_dummies(pd.DataFrame(X), columns=intersect)
# ......
In my case, the type of the nominal attributes is string
. Based on the tutorial from sklearn, I re-write the code to encode the nominal attributes by using one-hot encoding.
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector
class DfConverter():
def __init__(self):
super().__init__()
def fit_transform(self, X, y=None):
return self.transform(X=X, y=y)
def transform(self, X, y=None):
df = pd.DataFrame(X)
# automatically determine the data type for each columns
df = df.convert_dtypes()
return df
def fit(self, X, y=None):
return self
def get_pipeline(model_provider):
df_converter = DfConverter()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(transformers=[
# one-hot encode the nominal (string) attributes
('dynamic_cat', categorical_transformer, make_column_selector(dtype_include="string"))
], remainder="passthrough")
clf = Pipeline(steps=[
('df_converter', df_converter),
('preprocessor', preprocessor),
('classifier', model_provider())
])
return clf
# usage
## loading data
df_train = pd.read_csv("./data/train.csv")
df_test = pd.read_csv("./data/test.csv")
selected_features = [ ...... ]
x_train = df_train[selected_features]
x_test = df_test[selected_features]
y_train = df_train["label"]
y_test = df_test["label"]
## training
knn = get_pipeline(lambda: KNeighborsClassifier(n_jobs=-1))
sfs1 = SequentialFeatureSelector(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=4)
sfs1 = sfs1.fit(x_train, y_train)
Hope this helps. π
Thanks for sharing! This would be another nice addition for the tutorials.
@rasbt So, I have been working on a data that has 190 features, and each feature itself is a timeseries (like a 3D tabular data, where rows are samples, columns are features, and the depths are time series). So, I converted each time series into 12 features (for some reasons, I had to do feature engineering). Now, I have $190 \times 12$ features. I am thinking of doing feature selection but considering group feature selection.
Can't we do features_group
similar to what we did in Exhaustive Feature Selection? I mean it should solve the one-hot-encoding problem mentioned at the top of this issue.
So, for instance, if my groups are [[0, 1, 2], [3], [4, 5]]
, then my high_level_indices
is [0, 1, 2]
. So, I can iterate through high-level indices only. Does that work?
We might say if this parameter is provided, then non-float type is not supported (?!) . So, user should take care of preprocessing in this case.
UPDATE: I checkout a recent PR and it seems @rasbt has this idea that this task might be doable. (see PR #957 )
Hey @NimaSarajpoor, you are right the sequential feature selector could (/should) eventually have a feature group support similar to the exhaustive feature selector. Itβs something I was hoping to tackle eventually some time this year when time permits (was holding of with a new release until I get to this, because it would be nice to have a release that rolls this feature out for all three: sequential feature selector, exhaustive feature selector and feature importance permutations. The code base of the sequential feature selector is a bit more complicated. Personally I am also traveling the next two weeks and likely on mobile only. If someone in this thread is interested in tackling this that would be awesome of course :)
Cool. I can definitely work on this. I might be a little bit slow due to my current workload but I hope I can get it done in a reasonable time.
Thanks and no worries about the timeline at all! Currently so many things to catch up with π
@rasbt You may want to close this.
Good call
Dr. Raschka, Thank you for all the wonderful work! Truly amazing library!
I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always. The output of SFS should not output, say, 2 columns of the 3 encoded columns. Would you please let me know what your opinion is on this matter?