rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.83k stars 855 forks source link

Fixes #956 Add `features_group` parameter to ExhaustiveFeatureSelector [WIP] #957

Closed NimaSarajpoor closed 1 year ago

NimaSarajpoor commented 2 years ago

Description

This PR addresses issue #956. The idea is to allow user to consider some features to be considered together throughout the feature selection process.


What I have done so far:

What needs to be done:

Please let me know what you think


side note: I added some comments wherever I felt something should be noticed. We should remove the comments.

NimaSarajpoor commented 2 years ago

Recently, I found another PR ( #651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here.

Let's say our features are [1, 2, 3, 4, 5] and the user wants to:

So...how should we deal with this? We can:

rasbt commented 1 year ago

Thanks for this PR btw! I think having feature_group support would be an awesome feature (no pun intended) for so many people!

EDIT: We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.

rasbt commented 1 year ago

Recently, I found another PR ( https://github.com/rasbt/mlxtend/pull/651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here. Let's say our features are [1, 2, 3, 4, 5] and the user wants to: consider feature_groups as [[1], [2] , [3], [4,5]] And, consider fixed features: [1, 4] So...how should we deal with this? We can: throw a warning and consider [1, 4, 5] as fixed features? Or, throw an error and let the user know that there is a conflict between these two parameters? Or, ???

Oh good call. To be honest, I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.

So in the case above, maybe an error message would make sense like

raise ValueError("Feature 4 was specified as a fixed feature but feature 4 has other features in its feature group.")

And in the docstring, we can maybe say something like

If both fixed_features and feature_groups are used, all feature grouped members of a fixed feature need to be specified as a fixed feature.

What do you think?

NimaSarajpoor commented 1 year ago

I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.

You are right... It would be safer. We are putting just a little load on the shoulder of users to take care of such conflict. However, they would have then a better understanding of what's going on :)

NimaSarajpoor commented 1 year ago

Thanks for this PR btw! I think having feature_group support would be an awesome feature (no pun intended) for so many people!

Thanks for creating this opportunity :) I enjoy doing this :)

We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.

Sure.

I guess the next step would be unit tests. Would you like any help with it?

I will probably need some help πŸ˜„ For now, I am going to take a look at the existing test function(s) and see how things are. I will try to do it myself. If I get into any issue, I will let you know.

NimaSarajpoor commented 1 year ago

@rasbt Also, I do not have access to my main pc that contains my local repo and my in-progress work during the weekend. I use another pc during the weekend and thus I just review my codes. So, sometimes you may notice that I review my own code and catch some issues (so I do not forget them later when I want to push the next set of commits).

NimaSarajpoor commented 1 year ago

We also need to add two notes to the docstring:

rasbt commented 1 year ago

I like the suggestion! Slightly more concise version for the first:

If parameter feature_groups is not None, the number of features is equal to the number of feature groups (len(feature_groups)). For example, if feature_groups=[[0], [1], [2,3], [4]] the max_features value cannot exceed 4.

NimaSarajpoor commented 1 year ago

I like the suggestion! Slightly more concise version for the first:

Cool! Thanks for revising the note. I will add them to the docstring.

pep8speaks commented 1 year ago

Hello @NimaSarajpoor! Thanks for updating this PR.

Line 327:89: E501 line too long (94 > 88 characters) Line 361:89: E501 line too long (93 > 88 characters) Line 390:89: E501 line too long (89 > 88 characters)

Comment last updated at 2022-08-09 01:29:49 UTC
NimaSarajpoor commented 1 year ago

@rasbt UPDATE


Notes:

rasbt commented 1 year ago

Thanks so much for the update.

  1. If it is not too much hassle, maybe add the fixed_feature also here. We could have a separate PR, but that is maybe adding additional work for you since some of the steps are basically repeated.

  2. I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much

For the check, maybe the following ones are good:

a) a check that the feature groups contain all feature indices

if _merge_lists(feature_groups) != set(range(X.shape[1]):
    raise ValueError(`feature_group` must contain all features within `range(X.shape[1]`) 

b) check that feature groups are not empty:

for fg in feature_groups:
    if not fg:
    raise ValueError("Feature group can't be empty)

All tests are passing :)

Wohooo!

Ohh, 3 a) reminds me that the ExhaustiveFeatureSelector actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the current feature_groups implementation.

Option 1 (ideal):

I suppose what we want in this case is that the user can specify the feature groups by column names. For example, [['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']].

Option 2 (ok):

We could also let the feature group selection integer based, e.g., [[1, 2], [3], [4]] and select DataFrame columns by index for now. We can then add string support later on.

Option 3:

For now, we can just add a check and say that feature_groups are currently not supported for pandas DataFrames. We can then add Option 1 or 2 in the future in another PR.

What do you think, is option 1 feasible at this point or do you prefer Option 2/3 and to revisit this later?

NimaSarajpoor commented 1 year ago

maybe add the fixed_feature also here

Sure! Will do so.

2. I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much

Thank you! :)

For the check, maybe the following ones are good:

a) a check that the feature groups contain all feature indices

Just for the records, I think we "can" allow users to provide not all but just some of the features in the parameter features_group. For instance, I THINK that even if the parameter value is [[0],[1,2],[3]], the Exhaustive Feature Selection will be performed no matter how many features the data actually have. Having said that, I think if a user wants to do that, they should simply pass the chunk of their data that contains those features only. So, I think error message (a) is good as it prevents users from making any accidental mistakes.

b) check that feature groups are not empty:

for fg in feature_groups:
    if not fg:
  raise ValueError("Feature group can't be empty)

Cool

Ohh, 3 a) reminds me that the ExhaustiveFeatureSelector actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the current feature_groups implementation.

Option 1 (ideal):

I suppose what we want in this case is that the user can specify the feature groups by column names. For example, [['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']].

This is exactly what I thought after pushing the last set of commits! I am planning to work on this. And, then, I will take care of the fixed_feature :)

NimaSarajpoor commented 1 year ago

@rasbt UPDATE

What needs to be done:


Btw, I think we also need to support string values in the following case: data X is not DataFrame BUT custom_feature_names is provided by the user (see the input parameter custom_feature_names in function fit).

In that case, the user may want to provide string values for features_group and/or fixed_features.

Is it too much? I think we can revise the current code and maybe add one or two lines to support this. However, I assume we do not want to support EVERYTHING (right?) What do you think? (On second thought: Or, we can depreciate the support for custom_feature_names. If DataFrame df is supported, user can create data frames and use custom_feature_names as the name of features, and then pass the df to the program)


Please feel free to let me know whenever you feel the code is not clean/readable. So, I can make it better by either changing the design of code, improving name of variables, or providing better comments.

NimaSarajpoor commented 1 year ago

@rasbt I think everything should be good now. Something in my code was bothering me and I fixed it (cleaned it). So, I am going to stop pushing now :)

rasbt commented 1 year ago

Thanks for all the updates! On the one hand, I think the custom_feature_names parameter is very useful. However, I think it's from before we added pandas DataFrame support. I would that it has become kind of redundant and we can remove it for the sake of simplicity. Over time, things in the mlxtend library are becoming more and more complicated, so I guess it's fair that we remove one parameter in favor of adding the new feature_groups parameter πŸ˜‰.

To keep it manageable right now, and to not overcomplicate things when we add feature_groups, I would say let's drop the custom_feature_names right away (versus deprecating it). Let me take care of it in this PR before I continue with the review. I can add a documentation example to illustrate how users can achieve the same thing with pandas:

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.datasets import load_iris
import pandas as pd

iris_data = load_iris()
X, y = iris_data.data, iris_data.target

df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])

knn = KNeighborsClassifier(n_neighbors=3)

efs1 = EFS(knn, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=5)

efs1 = efs1.fit(df_X, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15
Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('Sepal length', 'Petal length', 'Petal width')
NimaSarajpoor commented 1 year ago

To keep it manageable right now, and to not overcomplicate things when we add feature_groups, I would say let's drop the custom_feature_names right away (versus deprecating it). Let me take care of it in this PR before I continue with the review

Sure. Removing the feature makes sense. Thanks for your effort.

I can add a documentation example to illustrate how users can achieve the same thing with pandas:

Sounds good :)

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

rasbt commented 1 year ago

Alright, just removed it to simplify the code. I may have to call it a day here, but I will be back with the code reviews in the next couple of days!

NimaSarajpoor commented 1 year ago

@rasbt Thanks for the changes! I am going to go through my code. If I notice anything should be revised, I review them and then wait and see your opinion on them.

Also, if you have an idea about what I should do as the next step in this PR, please let me know.

rasbt commented 1 year ago

Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code

NimaSarajpoor commented 1 year ago

Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code

Cool! I will take care of the comments and push the changes.

NimaSarajpoor commented 1 year ago

@rasbt I took care of the issues. I think things should be good now :)

rasbt commented 1 year ago

Wow thanks so much, looks great now and will merge!

NimaSarajpoor commented 1 year ago

@rasbt Thanks a lot for your guidance throughout this journey! I enjoyed working on this! So, thank you!

rasbt commented 1 year ago

Likewise, thanks so much for the high-quality contribution @NimaSarajpoor. This is well appreciated! Actually, I may make a small release in the following days if time permits!