Fixes #956 Add `features_group` parameter to ExhaustiveFeatureSelector [WIP]

NimaSarajpoor commented 2 years ago

Description

This PR addresses issue #956. The idea is to allow user to consider some features to be considered together throughout the feature selection process.

[x] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
[x] Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
[x] Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
[x] Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all CURRENT unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
[x] Checked for style issues by running flake8 ./mlxtend

What I have done so far:

Added parameter features_group to mlxtend/feature_selection/exhaustive_feature_selector.py
All EXISTING tests for feature_selection pass

What needs to be done:

Review changes and see if the taken approach is reasonable
Write test function for the new feature
Add this parameter to sequential_feature_selector.py

Please let me know what you think

side note: I added some comments wherever I felt something should be noticed. We should remove the comments.

NimaSarajpoor commented 2 years ago

Recently, I found another PR ( #651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here.

Let's say our features are [1, 2, 3, 4, 5] and the user wants to:

consider feature_groups as [[1], [2] , [3], [4,5]]
And, consider fixed features: [1, 4]

So...how should we deal with this? We can:

throw a warning and consider [1, 4, 5] as fixed features?
Or, throw an error and let the user know that there is a conflict between these two parameters?
Or, ???

rasbt commented 1 year ago

Thanks for this PR btw! I think having feature_group support would be an awesome feature (no pun intended) for so many people!

EDIT: We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.

rasbt commented 1 year ago

Recently, I found another PR ( https://github.com/rasbt/mlxtend/pull/651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here. Let's say our features are [1, 2, 3, 4, 5] and the user wants to: consider feature_groups as [[1], [2] , [3], [4,5]] And, consider fixed features: [1, 4] So...how should we deal with this? We can: throw a warning and consider [1, 4, 5] as fixed features? Or, throw an error and let the user know that there is a conflict between these two parameters? Or, ???

Oh good call. To be honest, I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.

So in the case above, maybe an error message would make sense like

raise ValueError("Feature 4 was specified as a fixed feature but feature 4 has other features in its feature group.")

And in the docstring, we can maybe say something like

If both fixed_features and feature_groups are used, all feature grouped members of a fixed feature need to be specified as a fixed feature.

What do you think?

NimaSarajpoor commented 1 year ago

I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.

You are right... It would be safer. We are putting just a little load on the shoulder of users to take care of such conflict. However, they would have then a better understanding of what's going on :)

NimaSarajpoor commented 1 year ago

Thanks for this PR btw! I think having feature_group support would be an awesome feature (no pun intended) for so many people!

Thanks for creating this opportunity :) I enjoy doing this :)

We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.

Sure.

I guess the next step would be unit tests. Would you like any help with it?

I will probably need some help 😄 For now, I am going to take a look at the existing test function(s) and see how things are. I will try to do it myself. If I get into any issue, I will let you know.

NimaSarajpoor commented 1 year ago

@rasbt Also, I do not have access to my main pc that contains my local repo and my in-progress work during the weekend. I use another pc during the weekend and thus I just review my codes. So, sometimes you may notice that I review my own code and catch some issues (so I do not forget them later when I want to push the next set of commits).

NimaSarajpoor commented 1 year ago

We also need to add two notes to the docstring:

for min_features / max_features: When the parameter feature_groups is provided (and not None), then the number of features is in fact corresponding to the number of groups of features. For instance, when feature_groups = [[0], [1], [2,3], [4]], the max_features can be "4" at the most as we have four groups only.
Although two or more individual features may be considered as one group throughout the feature-selection process, it does not mean the individual features of that group have the same impact on the outcome. For instance, in linear regression, the coefficient of the feature 2 and 3 can be different even if they are considered as one group in feature_groups.

rasbt commented 1 year ago

I like the suggestion! Slightly more concise version for the first:

If parameter feature_groups is not None, the number of features is equal to the number of feature groups (len(feature_groups)). For example, if feature_groups=[[0], [1], [2,3], [4]] the max_features value cannot exceed 4.

NimaSarajpoor commented 1 year ago

I like the suggestion! Slightly more concise version for the first:

Cool! Thanks for revising the note. I will add them to the docstring.

pep8speaks commented 1 year ago

Hello @NimaSarajpoor! Thanks for updating this PR.

In the file mlxtend/feature_selection/exhaustive_feature_selector.py:

Line 327:89: E501 line too long (94 > 88 characters) Line 361:89: E501 line too long (93 > 88 characters) Line 390:89: E501 line too long (89 > 88 characters)

Comment last updated at 2022-08-09 01:29:49 UTC

NimaSarajpoor commented 1 year ago

@rasbt UPDATE

I resolved the issues previously discussed (except one, please see note below)
I added new test function to test new parameter feature_groups
All tests are passing :)

Notes:

I have not added the new parameter fixed_features (PR #651) yet. Should I do it here in this branch? or, should I work on it in its own dedicated branch?
I did not add a check for the validity of the input parameter feature_groups. I think I need to check if the input is valid, and provide an error message otherwise.

rasbt commented 1 year ago

Thanks so much for the update.

If it is not too much hassle, maybe add the fixed_feature also here. We could have a separate PR, but that is maybe adding additional work for you since some of the steps are basically repeated.
I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much

For the check, maybe the following ones are good:

a) a check that the feature groups contain all feature indices

if _merge_lists(feature_groups) != set(range(X.shape[1]):
    raise ValueError(`feature_group` must contain all features within `range(X.shape[1]`)

b) check that feature groups are not empty:

for fg in feature_groups:
    if not fg:
    raise ValueError("Feature group can't be empty)

All tests are passing :)

Wohooo!

Ohh, 3 a) reminds me that the ExhaustiveFeatureSelector actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the current feature_groups implementation.

Option 1 (ideal):

I suppose what we want in this case is that the user can specify the feature groups by column names. For example, [['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']].

Option 2 (ok):

We could also let the feature group selection integer based, e.g., [[1, 2], [3], [4]] and select DataFrame columns by index for now. We can then add string support later on.

Option 3:

For now, we can just add a check and say that feature_groups are currently not supported for pandas DataFrames. We can then add Option 1 or 2 in the future in another PR.

What do you think, is option 1 feasible at this point or do you prefer Option 2/3 and to revisit this later?

NimaSarajpoor commented 1 year ago

maybe add the fixed_feature also here

Sure! Will do so.

2. I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much

Thank you! :)

For the check, maybe the following ones are good:

a) a check that the feature groups contain all feature indices

Just for the records, I think we "can" allow users to provide not all but just some of the features in the parameter features_group. For instance, I THINK that even if the parameter value is [[0],[1,2],[3]], the Exhaustive Feature Selection will be performed no matter how many features the data actually have. Having said that, I think if a user wants to do that, they should simply pass the chunk of their data that contains those features only. So, I think error message (a) is good as it prevents users from making any accidental mistakes.

b) check that feature groups are not empty:
for fg in feature_groups:
    if not fg:
  raise ValueError("Feature group can't be empty)

Cool

Ohh, 3 a) reminds me that the ExhaustiveFeatureSelector actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the current feature_groups implementation.

Option 1 (ideal):

I suppose what we want in this case is that the user can specify the feature groups by column names. For example, [['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']].

This is exactly what I thought after pushing the last set of commits! I am planning to work on this. And, then, I will take care of the fixed_feature :)

NimaSarajpoor commented 1 year ago

@rasbt UPDATE

Add support for fixed_features
Add string support for BOTH features_group and fixed_features
Add several checks for the validity of input fixed_features and features_group in the module
Add three new test functions: one for fixed_features AND two tests for fixed_features_and_features_group
Add tests for string support in features_group and fixed_features

What needs to be done:

Refactoring and polishing the code (I tried to have a good flow while I was writing the code. It would be great if you could take a look and see if we need to restructure the block of codes or not)
Add tests for the newly-added raise ValueErrors("message") in the code (because I can see there are some tests in the unit test that check if the errors are correctly raised. So, I assume I should add some tests for that. What do you think?)

Btw, I think we also need to support string values in the following case: data X is not DataFrame BUT custom_feature_names is provided by the user (see the input parameter custom_feature_names in function fit).

In that case, the user may want to provide string values for features_group and/or fixed_features.

Is it too much? I think we can revise the current code and maybe add one or two lines to support this. However, I assume we do not want to support EVERYTHING (right?) What do you think? (On second thought: Or, we can depreciate the support for custom_feature_names. If DataFrame df is supported, user can create data frames and use custom_feature_names as the name of features, and then pass the df to the program)

Please feel free to let me know whenever you feel the code is not clean/readable. So, I can make it better by either changing the design of code, improving name of variables, or providing better comments.

NimaSarajpoor commented 1 year ago

@rasbt I think everything should be good now. Something in my code was bothering me and I fixed it (cleaned it). So, I am going to stop pushing now :)

rasbt commented 1 year ago

Thanks for all the updates! On the one hand, I think the custom_feature_names parameter is very useful. However, I think it's from before we added pandas DataFrame support. I would that it has become kind of redundant and we can remove it for the sake of simplicity. Over time, things in the mlxtend library are becoming more and more complicated, so I guess it's fair that we remove one parameter in favor of adding the new feature_groups parameter 😉.

To keep it manageable right now, and to not overcomplicate things when we add feature_groups, I would say let's drop the custom_feature_names right away (versus deprecating it). Let me take care of it in this PR before I continue with the review. I can add a documentation example to illustrate how users can achieve the same thing with pandas:

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.datasets import load_iris
import pandas as pd

iris_data = load_iris()
X, y = iris_data.data, iris_data.target

df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])

knn = KNeighborsClassifier(n_neighbors=3)

efs1 = EFS(knn, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=5)

efs1 = efs1.fit(df_X, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)

Features: 15/15
Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('Sepal length', 'Petal length', 'Petal width')

NimaSarajpoor commented 1 year ago

To keep it manageable right now, and to not overcomplicate things when we add feature_groups, I would say let's drop the custom_feature_names right away (versus deprecating it). Let me take care of it in this PR before I continue with the review

Sure. Removing the feature makes sense. Thanks for your effort.

I can add a documentation example to illustrate how users can achieve the same thing with pandas:

Sounds good :)

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rasbt commented 1 year ago

Alright, just removed it to simplify the code. I may have to call it a day here, but I will be back with the code reviews in the next couple of days!

NimaSarajpoor commented 1 year ago

@rasbt Thanks for the changes! I am going to go through my code. If I notice anything should be revised, I review them and then wait and see your opinion on them.

Also, if you have an idea about what I should do as the next step in this PR, please let me know.

rasbt commented 1 year ago

Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code

NimaSarajpoor commented 1 year ago

Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code

Cool! I will take care of the comments and push the changes.

NimaSarajpoor commented 1 year ago

@rasbt I took care of the issues. I think things should be good now :)

rasbt commented 1 year ago

Wow thanks so much, looks great now and will merge!

NimaSarajpoor commented 1 year ago

@rasbt Thanks a lot for your guidance throughout this journey! I enjoyed working on this! So, thank you!

rasbt commented 1 year ago

Likewise, thanks so much for the high-quality contribution @NimaSarajpoor. This is well appreciated! Actually, I may make a small release in the following days if time permits!

rasbt / mlxtend

Fixes #956 Add `features_group` parameter to ExhaustiveFeatureSelector [WIP] #957

Description

Comment last updated at 2022-08-09 01:29:49 UTC