rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.83k stars 855 forks source link

Fixes 502 Add `feature_groups` parameter to SequentialFeatureSelection #965

Closed NimaSarajpoor closed 1 year ago

NimaSarajpoor commented 1 year ago

This PR is created to add support for feature_groups in SequentialFeatureSelection, as discussed in #502.

Before developing the new feature, I decided to clean the code first as I noticed it has some nested-if blocks. The module sequential_feature_selector.py now has about ~30~ 70 lines less, and I think it might be a little bit more readable.

I will add the parameter feature_groups in the nearby future.

NimaSarajpoor commented 1 year ago

@rasbt Okay...I think we are now at a good position to add parameter feature_groups. In the meantime, if you have any suggestion, please let me know.

NimaSarajpoor commented 1 year ago

I opened the issue #966 to discuss a few things regarding the current implementation. I think it would be easier to resolve those first if they are reasonable.

NimaSarajpoor commented 1 year ago

Quick recap on the major changes:

NimaSarajpoor commented 1 year ago

@rasbt So, I have changed some stuff in the module sequential_feature_selector.py. I am going to add the new param feature_groups to this version. However, it would be great if you could let me know that whether you are okay with the changes or not.

So, I will wait for your green light before moving forward. If you are not okay with the changes I made, I can then try to add the new param to the existing version of branch master. I know you are busy as we are close to the beginning of a new semester and you have other stuff in mind, so there is no rush :)


Update: I had a question about KeyboardInterrup but then I figured it out I guess. I did some minor changes to the code.

NimaSarajpoor commented 1 year ago

@rasbt So, it seems we are on the same page according to the changes. I already fixed the ones you are okay with (as discussed in #966). So, there is no need to push any new commits regarding #966.

I can move forward and add the new param, or I can wait till you check the code in this branch first.

codecov[bot] commented 1 year ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@2ed7ecf). Click here to learn what that means. The diff coverage is n/a.

Additional details and impacted files [![Impacted file tree graph](https://codecov.io/gh/rasbt/mlxtend/pull/965/graphs/tree.svg?width=650&height=150&src=pr&token=573esaO574&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Sebastian+Raschka)](https://codecov.io/gh/rasbt/mlxtend/pull/965?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Sebastian+Raschka) ```diff @@ Coverage Diff @@ ## master #965 +/- ## ========================================= Coverage ? 77.45% ========================================= Files ? 198 Lines ? 11165 Branches ? 1498 ========================================= Hits ? 8648 Misses ? 2304 Partials ? 213 ```
rasbt commented 1 year ago

Thanks for the refactoring, that's amazing! Really impressive. The code base is pretty unwieldy, and I am impressed that you were able to refactor it that much without breaking anything. Hats off and many thanks for this!

Btw just added some coverage back. The coverage of the SFS itself isn't too bad actually, around 97%.

Anyways. based on going through this, it looks all great and I would definitely keep these changes. So, it would be good to go for the feature_groups I guess.

NimaSarajpoor commented 1 year ago

@rasbt I added support for feature_groups and added several unit tests. I used the existing test functions to create new test functions to test feature_groups.


In short, the approach is as follows: We do forward (or backward) selection- with/without floating- by traversing groups of features (and then merge the groups before performing cross-validation). So, throughout the .fit process, the keys in the dictionary self.subsets_ is not number of features, but it is now the number of groups (so we can track what groups have been selected/excluded so far). Also, self.subsets_[k]["feature_idx"] now contains the indices of groups that are merged as selected features. Similarly, self.k_feature_idx_ contains the indices of merged groups (not the indices of features)

At the end of the .fit process (just before returning self): we replace these values with the the actual number of features selected. However, I do not change the key of self.subsets_ as I thought it might be better that the keys in subsets still reflect the number of selected groups and not the number of features.


The next steps are: (Try to be consistent with mlxtend/feature_selection/exhaustive_feature_selector.py)

rasbt commented 1 year ago

Thanks for adding all the code for this new feature! I have to head out soon, and I am not sure if I can review it all today, but I definitely want to pick it back up tomorrow morning. In the meantime, let me make some comments so I don't forget :)

At the end of the .fit process (just before returning self): we replace these values with the the actual number of features selected.

This is a good workaround, We would have to make sure though that this also works with the KeyboardInterrupt. Otherwise, how do you feel about introducing another variable (dictionary?) just for tracking?

as I thought it might be better that the keys in subsets still reflect the number of selected groups and not the number of features.

I think that would be fine. If someone doesn't use feature_groups, it would still be the number of features though, right?

NimaSarajpoor commented 1 year ago

I have to head out soon, and I am not sure if I can review it all today, but I definitely want to pick it back up tomorrow morning.

Cool!

, We would have to make sure though that this also works with the KeyboardInterrupt. Otherwise, how do you feel about introducing another variable (dictionary?) just for tracking?

Thanks for bringing that to my attention! I think we should be able to do that as follows:

if self.interrupted_:
    # modify dictionary
     return self

I will try it and if we realize that is not possible, I will use a new variable. If you think using a separate variable is cleaner / more readable I can definitely create a new variable and update the code accordingly.

. If someone doesn't use feature_groups, it would still be the number of features though, right?

That is correct. Because, in that case, each feature is treated as single group. Hence, it would be still the number of features when feature_groups=None.

pep8speaks commented 1 year ago

Hello @NimaSarajpoor! Thanks for updating this PR.

Line 285:13: W503 line break before binary operator Line 291:13: W503 line break before binary operator Line 297:13: W503 line break before binary operator Line 302:89: E501 line too long (123 > 88 characters) Line 357:89: E501 line too long (94 > 88 characters) Line 392:89: E501 line too long (93 > 88 characters) Line 575:29: W503 line break before binary operator Line 656:17: W503 line break before binary operator Line 657:17: W503 line break before binary operator Line 664:89: E501 line too long (91 > 88 characters) Line 681:21: W503 line break before binary operator Line 682:21: W503 line break before binary operator

Line 7:1: F401 'numpy.nan' imported but unused Line 12:1: F401 'sklearn.metrics.roc_auc_score' imported but unused

Line 1:1: F401 'copy.deepcopy' imported but unused

Comment last updated at 2022-09-12 03:48:20 UTC
rasbt commented 1 year ago

Thanks for bringing that to my attention! I think we should be able to do that as follows:

if self.interrupted_:
    # modify dictionary
     return self

Yeah, that's a good solution. Together with

That is correct. Because, in that case, each feature is treated as single group. Hence, it would be still the number of features when feature_groups=None.

That's probably ideal. This way, the behavior for non-feature-group uses stays the same. At the same, for the feature_groups we can tread the feature groups as "regular features" internally like you already do, which is probably less cluttered than introducing a new var

NimaSarajpoor commented 1 year ago

That is correct. Because, in that case, each feature is treated as single group. Hence, it would be still the number of features when feature_groups=None.

That's probably ideal. This way, the behavior for non-feature-group uses stays the same. At the same, for the feature_groups we can tread the feature groups as "regular features" internally like you already do, which is probably less cluttered than introducing a new var

Thanks for the feedback. My goal was to add param without breaking anything, if possible. Thanks for your help. I will try to add some checks for the inputs as well as support for string.


side note: At the end, you may want to remove custom_feature_names (similar to what you did in ExhaustiveFeatureSelector.

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

rasbt commented 1 year ago

Just wanted to ask whether we should reuse the code from ExhaustiveFeatureSelector regarding string support in feature_groups. Like you said, the custom_feature_names can then be removed in a similar fashion. I can help with that if you want

NimaSarajpoor commented 1 year ago

Just wanted to ask whether we should reuse the code from ExhaustiveFeatureSelector regarding string support in feature_groups

I think that is a great idea. In fact, I was looking at both codes last night and I think there are some opportunities for refactoring. Are you thinking of putting the common code in a utilites.py file or something like that?

NimaSarajpoor commented 1 year ago

I can help with that if you want

Sure. I will try to remove it...and if I realize that I cannot do it, I will let you know :)

rasbt commented 1 year ago

Are you thinking of putting the common code in a utilites.py file or something like that?

Oh yeah, that's a great idea I'd say. For now, I think putting it into a utilities.py inside mlxtend/feature_selection would be a good way to start. Some of the functionality (like the feature idx to name mapping utilities) could also be used more broadly later on; but I wouldn't worry about it for now. Internal stuff can always be moved later without breaking someone else's code :)

NimaSarajpoor commented 1 year ago

@rasbt FYI: I tried to solve the error raised by the new test functions. I then realized it might be a little tricky as it is related to the current design. So, for now, I am bypassing them. I will take care of it after refactoring the two modules

rasbt commented 1 year ago

Wow, it looks like you did a ton of work on this. Sorry that this turned out to be such a big ask. Btw it does look like it is working now though?

NimaSarajpoor commented 1 year ago

Btw it does look like it is working now though?

I detected the problem. But, I decided to resolve it later after refactoring/cleaning the code as it would be probably easier then. I did:

# just for two lastly-added test functions

def test_new():
    return True
    # rest of code

So, I am bypassing them, and I will bring them back in later.

NimaSarajpoor commented 1 year ago

@rasbt I will let you know when things are good :)

NimaSarajpoor commented 1 year ago

@rasbt Could you please let me know what you think about this?

What should the program return when KeyboardInterrup is raised? Let's consider sequential_feature_selector.py for now. Let's assume user is interested in forward selection and k_features is set to (3, 5). What would we want user to see when the program got interrupted right after the first or second round of feature selection? Note that in such case, the keys in the dictionary self.subsets_ are not in {3, 4, 5}. So, I think it might be better to avoid post-processing in such case. That is why I did return self right after exception block in the sequential case. Also, self.fitted should remain False in such case (what do you think?)

Now, if we take a look at the exhaustive module, we can see that we does not return self after exception block. If I do:

# in exhaustive_feature_selector.py

except KeyboardInterrup
   self.interrupted_ = True
   sys.stderr.write("\nSTOPPING EARLY DUE TO KEYBOARD INTERRUPT...")
   return self  # NEW

then, I will get the following error when I run the pytest:

# in tests/test_exhaustive_feature_selector.py

# in function `test_check_pandas_dataframe_fit`

sfs1._TESTING_INTERRUPT_MODE = True
        sfs1 = sfs1.fit(df, y)
>       assert efs1.best_idx_ == (0, 1), efs1.best_idx_
E       AssertionError: None
E       assert None == (0, 1)

Which means we are testing efs.best_idx_ in such case. However, I think we should just do return self inside exception block (or right after that) and avoid post-processing. we should not test the best_idx_. I mean we can test the last indices of features or the dictionary subsets_ to test _TESTING_INTERRUPT_MODE. We should not give user self.best_idx_. If user is interested in the best-so-far score or indices, then they should be able to get it from the dictionary .subsets_. What do you think?

rasbt commented 1 year ago

Good points. I think the concern you have is that best_idx_ is not the "best" index (or None) at the point when the run was interrupted.

A counter argument for postprocessing though is that (particularly for the EFS), a run can take a very long time. Sometimes, users want to manually interrupt it due to time issues but using the results up to that point.

If we don't do the post-processing for them, the users would have to figure it out themselves. It's not a deal-breaker, but I think when it comes to convenience, I think in >80% of the cases users would probably prefer if an interrupted SFS/EFS can later be used the same way as a fully fitted one without extra steps.

Or, maybe a middle ground would be this:

we do what you propose regarding

1) returning self 2) leaving self.fitted=False

but we add a clean-up method that users have to call (at their own responsibility) after interruption. What would be a good name for this?

a. cleanup b. cleanup_from_interrupt c. postprocess_from_interrupt d. finalize_fit

I would say b) & c) are more clear, but if we go with a) or d), that's also something we can use internally (just the last steps refactored into a method).

What do you think?

NimaSarajpoor commented 1 year ago

I think the concern you have is that best_idx_ is not the "best" index (or None) at the point when the run was interrupted.

Yes. That's correct.

A counter argument for postprocessing though is that (particularly for the EFS), a run can take a very long time. Sometimes, users want to manually interrupt it due to time issues but using the results up to that point.

If we don't do the post-processing for them, the users would have to figure it out themselves. It's not a deal-breaker, but I think when it comes to convenience, I think in >80% of the cases users would probably prefer if an interrupted SFS/EFS can later be used the same way as a fully fitted one without extra steps.

Yes, I understand. This is to just give users something to work with.

Or, maybe a middle ground would be this:

we do what you propose regarding

  1. returning self
  2. leaving self.fitted=False

but we add a clean-up method that users have to call (at their own responsibility) after interruption. What would be a good name for this?

a. cleanup b. cleanup_from_interrupt c. postprocess_from_interrupt d. finalize_fit

I would say b) & c) are more clear, but if we go with a) or d), that's also something we can use internally (just the last steps refactored into a method).

What do you think?

I think this is a good approach. I think d should work. We may do something like this:

def finalize_fit(self, ...):
      # getting best score

def fit():
    # key is interrupted
    return self

    # otherwise:
     self.finalize_fit(...)
     return self

And, then in the docstring, we can explain that when keyboard is interrupted, users can still get the best score and the corresponding subset of features by using .finalize_fit at their own responsibility.

Sounds good? (Is this what you had in mind?)

rasbt commented 1 year ago

Yes, I think this is perfect. This way, people who let's say accidentally interrupt it are not misled, because it will take an extra deliberate step. But the step is not that inconvenient, so if you know what you are doing, that's a good, easy workaround!

NimaSarajpoor commented 1 year ago

@rasbt There is still 1 test function that fails. Something weird is going on here...

According to the test function tests/test_exhaustive_feature_selector.py:: test_knn_wo_cv, we can see:

expect = {
        0: {
            "feature_idx": (0, 1),
            "feature_names": ("0", "1"),
            "avg_score": 0.82666666666666666,
            "cv_scores": np.array([0.82666667]),
        },
       ....

And, this test function is passing. Now, if I run this function in the Jupyter, I get a different result(!):

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
efs1 = EFS(
    knn,
    min_features=2,
    max_features=3,
    scoring="accuracy",
    cv=0,
    print_progress=False,
)
efs1 = efs1.fit(X, y)
efs1.subsets_

And, I now see that expect[0]['avg_score'] is 0.8333333333333334. How is that possible? Note that this test function is passing when I do pytest. But it should fail. right? (or maybe I am missing something)


Now, please take a look at tests/test_sequential_feature_selector_feature_groups.py:: test_knn_wo_cv_with_fixed_features_and_feature_groups_case1 (This is the only failure we are getting in unit testing) The expected score for features (0, 1) is 0.8266... but it is different than the score we are getting, which is 0.83.

Any thoughts?

NimaSarajpoor commented 1 year ago

@rasbt ~Weird! I am getting one failure in unit testing when I run it in my local repo! (I explained it in my previous comment)~

NimaSarajpoor commented 1 year ago

@rasbt Update:


I think there are three things we may want to do as the final steps (or leave them to someone else):

rasbt commented 1 year ago

Thanks for the updates!

Add test for dataframe/string in sequential feature selection with feature_groups

I can take care of this above

Move some of the checks to ...

Hm, yeah, that's not an easy thing. In general, I think that scikit-learn does all checks in fit(). I forgot the details, but there was a rationale behind it. However, I agree with you and some things are better checked in init. Like if there are 2 functionalities that interact/depend on each other like fixed_features and feature_groups. There could be an extra check in fit() to check if the data indeed contains the specified features. But on the other hand, I feel like this can also be up to the user's responsibility, and we don't have to check each detail :P

rasbt commented 1 year ago

Just added the unit tests. I think there's one unexpected behavior (second from the bottom) that doesn't look right (haha, or am I overlooking something?)

def test_check_pandas_dataframe_with_feature_groups_and_fixed_features():
    iris = load_iris()
    X = iris.data
    y = iris.target
    lr = SoftmaxRegression(random_seed=123)

    df = pd.DataFrame(
        X, columns=["sepal length", "sepal width", "petal length", "petal width"]
    )
    sfs1 = SFS(
        lr,
        k_features=2,
        forward=True,
        floating=False,
        scoring="accuracy",
        feature_groups=[
            ["petal length", "petal width"],
            ["sepal length"],
            ["sepal width"],
        ],
        fixed_features=[ "sepal length", "petal length", "petal width"],
        cv=0,
        verbose=0,
        n_jobs=1,
    )

    sfs1 = sfs1.fit(df, y)
    assert sfs1.k_feature_names_ == (
        'sepal length', 'petal length', 'petal width',
    ), sfs1.k_feature_names_

This returns ('sepal width', 'petal length', 'petal width')

But I think that's not possible. I.e., it can't return 'sepal width' because it's not in fixed_features. For the integer index version, this seems to work correctly:

def test_check_feature_groups():
    iris = load_iris()
    X = iris.data
    y = iris.target
    lr = SoftmaxRegression(random_seed=123)
    sfs1 = SFS(
        lr,
        k_features=2,
        forward=True,
        floating=False,
        scoring="accuracy",
        feature_groups=[[2, 3], [0], [1]],
        fixed_features=[0, 2, 3],
        cv=0,
        verbose=0,
        n_jobs=1,
    )

    sfs1 = sfs1.fit(X, y)
    assert sfs1.k_feature_idx_ == (0, 2, 3), sfs1.k_feature_idx_

Or am I confused now? 😅

NimaSarajpoor commented 1 year ago

@rasbt Thanks for the help. Yeah... it seems something is off...I will investigate it. (It is good that you caught it! Thanks!!)

NimaSarajpoor commented 1 year ago

@rasbt It seems that error you mentioned is now resolved. Btw, I think I did not do anything about that test function. I am now going to resolve the one that still fails.

NimaSarajpoor commented 1 year ago

@rasbt Updates


I think things should be good now :)

rasbt commented 1 year ago

Thanks so much, this looks reasonable to me! I will see if there's anything to add on the documentation front but other than that, I think this is good to merge. I am also planning to make a release this weekend so that people can actually start using the awesome features you've added :)

NimaSarajpoor commented 1 year ago

Cool :) Thanks for your help and support throughout this journey. It has been fun :)