Closed NimaSarajpoor closed 1 year ago
Recently, I found another PR ( #651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here.
Let's say our features are [1, 2, 3, 4, 5]
and the user wants to:
feature_groups
as [[1], [2] , [3], [4,5]]
[1, 4]
So...how should we deal with this? We can:
[1, 4, 5]
as fixed features?Thanks for this PR btw! I think having feature_group
support would be an awesome feature (no pun intended) for so many people!
EDIT: We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.
Recently, I found another PR ( https://github.com/rasbt/mlxtend/pull/651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here. Let's say our features are [1, 2, 3, 4, 5] and the user wants to: consider feature_groups as [[1], [2] , [3], [4,5]] And, consider fixed features: [1, 4] So...how should we deal with this? We can: throw a warning and consider [1, 4, 5] as fixed features? Or, throw an error and let the user know that there is a conflict between these two parameters? Or, ???
Oh good call. To be honest, I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.
So in the case above, maybe an error message would make sense like
raise ValueError("Feature 4 was specified as a fixed feature but feature 4 has other features in its feature group.")
And in the docstring, we can maybe say something like
If both
fixed_features
andfeature_groups
are used, all feature grouped members of a fixed feature need to be specified as a fixed feature.
What do you think?
I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.
You are right... It would be safer. We are putting just a little load on the shoulder of users to take care of such conflict. However, they would have then a better understanding of what's going on :)
Thanks for this PR btw! I think having
feature_group
support would be an awesome feature (no pun intended) for so many people!
Thanks for creating this opportunity :) I enjoy doing this :)
We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.
Sure.
I guess the next step would be unit tests. Would you like any help with it?
I will probably need some help π For now, I am going to take a look at the existing test function(s) and see how things are. I will try to do it myself. If I get into any issue, I will let you know.
@rasbt Also, I do not have access to my main pc that contains my local repo and my in-progress work during the weekend. I use another pc during the weekend and thus I just review my codes. So, sometimes you may notice that I review my own code and catch some issues (so I do not forget them later when I want to push the next set of commits).
We also need to add two notes to the docstring:
for min_features
/ max_features
: When the parameter feature_groups
is provided (and not None
), then the number of features is in fact corresponding to the number of groups of features. For instance, when feature_groups = [[0], [1], [2,3], [4]]
, the max_features
can be "4" at the most as we have four groups only.
Although two or more individual features may be considered as one group throughout the feature-selection process, it does not mean the individual features of that group have the same impact on the outcome. For instance, in linear regression, the coefficient of the feature 2
and 3
can be different even if they are considered as one group in feature_groups
.
I like the suggestion! Slightly more concise version for the first:
If parameter
feature_groups
is notNone
, the number of features is equal to the number of feature groups (len(feature_groups)
). For example, iffeature_groups=[[0], [1], [2,3], [4]]
themax_features
value cannot exceed 4.
I like the suggestion! Slightly more concise version for the first:
Cool! Thanks for revising the note. I will add them to the docstring.
Hello @NimaSarajpoor! Thanks for updating this PR.
Line 327:89: E501 line too long (94 > 88 characters) Line 361:89: E501 line too long (93 > 88 characters) Line 390:89: E501 line too long (89 > 88 characters)
@rasbt UPDATE
feature_groups
Notes:
fixed_features
(PR #651) yet. Should I do it here in this branch? or, should I work on it in its own dedicated branch?feature_groups
. I think I need to check if the input is valid, and provide an error message otherwise.Thanks so much for the update.
If it is not too much hassle, maybe add the fixed_feature
also here. We could have a separate PR, but that is maybe adding additional work for you since some of the steps are basically repeated.
I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much
For the check, maybe the following ones are good:
a) a check that the feature groups contain all feature indices
if _merge_lists(feature_groups) != set(range(X.shape[1]):
raise ValueError(`feature_group` must contain all features within `range(X.shape[1]`)
b) check that feature groups are not empty:
for fg in feature_groups:
if not fg:
raise ValueError("Feature group can't be empty)
All tests are passing :)
Wohooo!
Ohh, 3 a) reminds me that the ExhaustiveFeatureSelector
actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the current feature_groups
implementation.
Option 1 (ideal):
I suppose what we want in this case is that the user can specify the feature groups by column names. For example, [['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']]
.
Option 2 (ok):
We could also let the feature group selection integer based, e.g., [[1, 2], [3], [4]]
and select DataFrame columns by index for now. We can then add string support later on.
Option 3:
For now, we can just add a check and say that feature_groups
are currently not supported for pandas DataFrames. We can then add Option 1 or 2 in the future in another PR.
What do you think, is option 1 feasible at this point or do you prefer Option 2/3 and to revisit this later?
maybe add the
fixed_feature
also here
Sure! Will do so.
2. I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much
Thank you! :)
For the check, maybe the following ones are good:
a) a check that the feature groups contain all feature indices
Just for the records, I think we "can" allow users to provide not all but just some of the features in the parameter features_group
. For instance, I THINK that even if the parameter value is [[0],[1,2],[3]]
, the Exhaustive Feature Selection
will be performed no matter how many features the data actually have. Having said that, I think if a user wants to do that, they should simply pass the chunk of their data that contains those features only. So, I think error message (a) is good as it prevents users from making any accidental mistakes.
b) check that feature groups are not empty:
for fg in feature_groups: if not fg: raise ValueError("Feature group can't be empty)
Cool
Ohh, 3 a) reminds me that the
ExhaustiveFeatureSelector
actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the currentfeature_groups
implementation.Option 1 (ideal):
I suppose what we want in this case is that the user can specify the feature groups by column names. For example,
[['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']]
.
This is exactly what I thought after pushing the last set of commits! I am planning to work on this. And, then, I will take care of the fixed_feature
:)
@rasbt UPDATE
fixed_features
features_group
and fixed_features
fixed_features
and features_group
in the module fixed_features
AND two tests for fixed_features_and_features_group
features_group
and fixed_features
What needs to be done:
raise ValueErrors("message")
in the code (because I can see there are some tests in the unit test that check if the errors are correctly raised. So, I assume I should add some tests for that. What do you think?)Btw, I think we also need to support string values in the following case: data X
is not DataFrame BUT custom_feature_names
is provided by the user (see the input parameter custom_feature_names
in function fit
).
In that case, the user may want to provide string values for features_group
and/or fixed_features
.
Is it too much? I think we can revise the current code and maybe add one or two lines to support this. However, I assume we do not want to support EVERYTHING (right?) What do you think?
(On second thought: Or, we can depreciate the support for custom_feature_names
. If DataFrame df
is supported, user can create data frames and use custom_feature_names
as the name of features, and then pass the df
to the program)
Please feel free to let me know whenever you feel the code is not clean/readable. So, I can make it better by either changing the design of code, improving name of variables, or providing better comments.
@rasbt I think everything should be good now. Something in my code was bothering me and I fixed it (cleaned it). So, I am going to stop pushing now :)
Thanks for all the updates! On the one hand, I think the custom_feature_names
parameter is very useful. However, I think it's from before we added pandas DataFrame support. I would that it has become kind of redundant and we can remove it for the sake of simplicity. Over time, things in the mlxtend library are becoming more and more complicated, so I guess it's fair that we remove one parameter in favor of adding the new feature_groups
parameter π.
To keep it manageable right now, and to not overcomplicate things when we add feature_groups
, I would say let's drop the custom_feature_names
right away (versus deprecating it). Let me take care of it in this PR before I continue with the review. I can add a documentation example to illustrate how users can achieve the same thing with pandas:
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.datasets import load_iris
import pandas as pd
iris_data = load_iris()
X, y = iris_data.data, iris_data.target
df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
knn = KNeighborsClassifier(n_neighbors=3)
efs1 = EFS(knn,
min_features=1,
max_features=4,
scoring='accuracy',
print_progress=True,
cv=5)
efs1 = efs1.fit(df_X, y)
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15
Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('Sepal length', 'Petal length', 'Petal width')
To keep it manageable right now, and to not overcomplicate things when we add
feature_groups
, I would say let's drop thecustom_feature_names
right away (versus deprecating it). Let me take care of it in this PR before I continue with the review
Sure. Removing the feature makes sense. Thanks for your effort.
I can add a documentation example to illustrate how users can achieve the same thing with pandas:
Sounds good :)
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Alright, just removed it to simplify the code. I may have to call it a day here, but I will be back with the code reviews in the next couple of days!
@rasbt Thanks for the changes! I am going to go through my code. If I notice anything should be revised, I review them and then wait and see your opinion on them.
Also, if you have an idea about what I should do as the next step in this PR, please let me know.
Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code
Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code
Cool! I will take care of the comments and push the changes.
@rasbt I took care of the issues. I think things should be good now :)
Wow thanks so much, looks great now and will merge!
@rasbt Thanks a lot for your guidance throughout this journey! I enjoyed working on this! So, thank you!
Likewise, thanks so much for the high-quality contribution @NimaSarajpoor. This is well appreciated! Actually, I may make a small release in the following days if time permits!
Description
This PR addresses issue #956. The idea is to allow user to consider some features to be considered together throughout the feature selection process.
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all CURRENT unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend
What I have done so far:
features_group
tomlxtend/feature_selection/exhaustive_feature_selector.py
feature_selection
passWhat needs to be done:
sequential_feature_selector.py
Please let me know what you think
side note: I added some comments wherever I felt something should be noticed. We should remove the comments.