rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.85k stars 857 forks source link

WIP: having exhaustive feature selector take a custom iterator as input #834

Open jonathan-taylor opened 3 years ago

jonathan-taylor commented 3 years ago

Code of Conduct

A working candidate for https://github.com/rasbt/mlxtend/issues/833

Description

Related issues or pull requests

Pull Request Checklist

pep8speaks commented 3 years ago

Hello @jonathan-taylor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 289:80: E501 line too long (97 > 79 characters)

Line 119:1: W293 blank line contains whitespace Line 132:5: E303 too many blank lines (2)

Line 29:1: E302 expected 2 blank lines, found 1 Line 38:37: W291 trailing whitespace Line 39:46: W291 trailing whitespace Line 44:70: W291 trailing whitespace Line 216:1: W293 blank line contains whitespace Line 260:80: E501 line too long (80 > 79 characters) Line 261:80: E501 line too long (82 > 79 characters) Line 263:80: E501 line too long (85 > 79 characters) Line 264:80: E501 line too long (86 > 79 characters) Line 414:80: E501 line too long (86 > 79 characters) Line 424:43: W291 trailing whitespace Line 446:44: W291 trailing whitespace Line 501:1: W293 blank line contains whitespace Line 522:1: W391 blank line at end of file

Line 6:2: W291 trailing whitespace Line 16:22: E128 continuation line under-indented for visual indent Line 17:22: E128 continuation line under-indented for visual indent Line 18:22: E128 continuation line under-indented for visual indent Line 20:1: E302 expected 2 blank lines, found 1 Line 32:40: W291 trailing whitespace Line 45:1: W293 blank line contains whitespace Line 68:80: E501 line too long (80 > 79 characters) Line 69:80: E501 line too long (84 > 79 characters) Line 75:78: W291 trailing whitespace Line 84:80: E501 line too long (81 > 79 characters) Line 131:13: E741 ambiguous variable name 'l' Line 151:80: E501 line too long (98 > 79 characters) Line 155:80: E501 line too long (89 > 79 characters) Line 158:1: W293 blank line contains whitespace Line 172:57: W291 trailing whitespace Line 192:1: W293 blank line contains whitespace Line 199:44: W291 trailing whitespace Line 205:1: W293 blank line contains whitespace Line 240:80: E501 line too long (80 > 79 characters) Line 241:80: E501 line too long (84 > 79 characters) Line 247:78: W291 trailing whitespace Line 256:80: E501 line too long (81 > 79 characters) Line 270:1: W293 blank line contains whitespace Line 277:71: W291 trailing whitespace Line 281:72: W291 trailing whitespace Line 285:1: W293 blank line contains whitespace Line 295:57: W291 trailing whitespace Line 308:43: E261 at least two spaces before inline comment Line 309:80: E501 line too long (99 > 79 characters) Line 310:65: E128 continuation line under-indented for visual indent Line 310:80: E501 line too long (112 > 79 characters) Line 314:43: E261 at least two spaces before inline comment Line 315:80: E501 line too long (96 > 79 characters) Line 316:68: E128 continuation line under-indented for visual indent Line 316:80: E501 line too long (115 > 79 characters) Line 326:1: W293 blank line contains whitespace Line 355:80: E501 line too long (80 > 79 characters) Line 356:80: E501 line too long (84 > 79 characters) Line 364:78: W291 trailing whitespace Line 373:80: E501 line too long (81 > 79 characters) Line 379:54: W291 trailing whitespace Line 407:80: E501 line too long (93 > 79 characters) Line 449:80: E501 line too long (80 > 79 characters) Line 450:80: E501 line too long (84 > 79 characters) Line 458:78: W291 trailing whitespace Line 467:80: E501 line too long (81 > 79 characters) Line 473:54: W291 trailing whitespace Line 501:80: E501 line too long (93 > 79 characters) Line 513:1: W293 blank line contains whitespace Line 516:1: E303 too many blank lines (3) Line 537:80: E501 line too long (80 > 79 characters) Line 543:74: W291 trailing whitespace Line 558:50: W291 trailing whitespace Line 568:47: W291 trailing whitespace Line 588:1: W293 blank line contains whitespace Line 612:1: E302 expected 2 blank lines, found 1 Line 617:40: W291 trailing whitespace Line 635:1: E302 expected 2 blank lines, found 1 Line 641:40: W291 trailing whitespace Line 665:80: E501 line too long (94 > 79 characters) Line 668:1: W293 blank line contains whitespace Line 669:1: E302 expected 2 blank lines, found 1 Line 682:33: E261 at least two spaces before inline comment Line 689:1: W293 blank line contains whitespace Line 690:1: E302 expected 2 blank lines, found 1 Line 703:33: E261 at least two spaces before inline comment Line 740:39: W291 trailing whitespace Line 750:1: W293 blank line contains whitespace Line 761:1: W293 blank line contains whitespace Line 764:33: E261 at least two spaces before inline comment Line 764:34: E262 inline comment should start with '# ' Line 770:39: E261 at least two spaces before inline comment

Line 16:1: W293 blank line contains whitespace Line 17:1: E302 expected 2 blank lines, found 1 Line 25:44: E231 missing whitespace after ',' Line 56:34: E231 missing whitespace after ',' Line 56:36: E231 missing whitespace after ',' Line 66:1: E302 expected 2 blank lines, found 1 Line 70:8: E231 missing whitespace after ',' Line 76:44: E231 missing whitespace after ',' Line 89:8: E231 missing whitespace after ',' Line 95:53: E231 missing whitespace after ',' Line 96:55: E231 missing whitespace after ',' Line 103:1: E302 expected 2 blank lines, found 1 Line 110:1: W293 blank line contains whitespace Line 112:8: E231 missing whitespace after ',' Line 118:53: E231 missing whitespace after ',' Line 119:55: E231 missing whitespace after ',' Line 138:1: W293 blank line contains whitespace Line 139:1: E302 expected 2 blank lines, found 1 Line 143:8: E231 missing whitespace after ',' Line 152:57: E231 missing whitespace after ',' Line 153:59: E231 missing whitespace after ',' Line 154:80: E501 line too long (81 > 79 characters) Line 161:1: E302 expected 2 blank lines, found 1 Line 175:59: E231 missing whitespace after ',' Line 176:61: E231 missing whitespace after ',' Line 177:1: W293 blank line contains whitespace Line 185:1: E302 expected 2 blank lines, found 1 Line 193:69: W291 trailing whitespace Line 194:1: W293 blank line contains whitespace Line 195:1: E302 expected 2 blank lines, found 1 Line 200:80: E501 line too long (112 > 79 characters) Line 208:76: W291 trailing whitespace Line 209:44: W291 trailing whitespace Line 217:1: W293 blank line contains whitespace Line 241:15: E231 missing whitespace after ',' Line 243:1: W293 blank line contains whitespace Line 247:1: E302 expected 2 blank lines, found 1 Line 253:80: E501 line too long (109 > 79 characters) Line 261:75: W291 trailing whitespace Line 262:47: W291 trailing whitespace Line 271:1: W293 blank line contains whitespace Line 294:15: E231 missing whitespace after ',' Line 296:1: W293 blank line contains whitespace Line 300:1: E302 expected 2 blank lines, found 1 Line 306:80: E501 line too long (109 > 79 characters) Line 314:75: W291 trailing whitespace Line 315:47: W291 trailing whitespace Line 324:1: W293 blank line contains whitespace Line 347:15: E231 missing whitespace after ',' Line 349:1: W293 blank line contains whitespace Line 353:1: E302 expected 2 blank lines, found 1 Line 358:80: E501 line too long (121 > 79 characters) Line 366:46: W291 trailing whitespace Line 375:1: W293 blank line contains whitespace Line 411:15: E231 missing whitespace after ',' Line 413:1: W293 blank line contains whitespace Line 417:1: W293 blank line contains whitespace Line 417:1: W391 blank line at end of file

Comment last updated at 2021-11-29 18:26:37 UTC
jonathan-taylor commented 2 years ago

Obviously some linting needed, but have generalized the sequential / exhaustive feature selector to allow for categorical and custom search strategies. The previous searches default to special cases now. Post-processing of search results still need implementing.

jonathan-taylor commented 2 years ago

The type checking is more just a declaration of what a strategy is at this point.

Will work on PEP8 issues and also will try to have EFS and SFS use the generic one. I think EFS is straightforward, but SFS I have to understand the "floating" logic a little better -- I gather this is just "search both forward and backward" but somehow in self.subsets_​ we only keep a given model of a fixed size -- is this meant to be the best model of a fixed size? Or maybe I still have the logic wrong.


From: Sebastian Raschka @.> Sent: Tuesday, October 5, 2021 6:20 PM To: rasbt/mlxtend @.> Cc: Jonathan Taylor @.>; Mention @.> Subject: Re: [rasbt/mlxtend] WIP: having exhaustive feature selector take a custom iterator as input (#834)

@rasbt commented on this pull request.


In mlxtend/feature_selection/columns.pyhttps://github.com/rasbt/mlxtend/pull/834#discussion_r722814149:

@@ -0,0 +1,271 @@ + +from typing import NamedTuple, Any +from copy import copy + +import numpy as np +from sklearn.base import clone +from sklearn.preprocessing import (OneHotEncoder,

  • OrdinalEncoder) +from sklearn.utils.validation import check_is_fitted +from sklearn.exceptions import NotFittedError
  • +class Column(NamedTuple):

There is currently no type checking implemented in mlxtend, but maybe one day, so why not being proactive (or in other words, thanks for adhering to potential future best practices :)).


In mlxtend/feature_selection/tests/test_generic_selector.pyhttps://github.com/rasbt/mlxtend/pull/834#discussion_r722814708:

+ +have_pandas = True +try:

  • import pandas as pd +except ImportError:
  • have_pandas = False
  • +def test_exhaustive_selector():

  • n, p = 100, 5
  • X = np.random.standard_normal((n, p))
  • Y = np.random.standard_normal(n)
  • strategy = min_max(X,
  • max_features=4,
  • fixed_features=[2,3])

There are some pep8 issues, but we can fix them later, no worries.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rasbt/mlxtend/pull/834#pullrequestreview-772111625, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACTM23LRNTCT6Z7H3MLC6LUFOP5XANCNFSM47DDCTAA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

rasbt commented 2 years ago

Thanks a lot for the PR, this is very exciting!

Big picture-wise, there are a few thoughts.

1) What do we do with the existing ExhaustiveFeatureSelector and SequentialFeatureSelector? We could deprecate them, that is, remove them from the documentation but leave them in the code for a few versions / years.

2) If we do deprecate the existing SFS, two missing features would be floating-forward and floating-backward. I think right now, via

    for direction in ['forward', 'backward', 'both']:
         strategy = step(X,
                         direction=direction,
                         max_features=p,
                         fixed_features=[2,3],
                         categorical_features=categorical_features)

it only supports the standard forward and backward. I assume that 'both' means it runs forward first and finds the best set. Then it runs backward (independently) to find the best set. Then, the best set is determined by comparing the results from forward and backward? This is actually a neat addition.

3) If we deprecate and add the floating variants, I think the only thing that we need to ensure is that it still remains compatible with scikit-learn pipelines and maybe GridSearchCV.

Amazing work, though. What you put together here is really exciting and impressive!

rasbt commented 2 years ago

I think EFS is straightforward, but SFS I have to understand the "floating" logic a little better -- I gather this is just "search both forward and backward" but somehow in self.subsets_​ we only keep a given model of a fixed size -- is this meant to be the best model of a fixed size? Or maybe I still have the logic wrong.

Oh, I should have described this a bit better in the docs. In the floating forward variant, it allows you to delete a previously selected feature if it improves performance. In the floating backward variant, it allows you to add back a previously deleted feature.

Maybe it is easier to explain with a toy example.

Say we have a feature set [0, 1, 3, 4, 5] and we do sequential "floating" forward selection.

Round 1

Starting with the empty set [], we check all features and find that feature 1 is best:

Round 2

Starting with the set [1], we check all features and find that feature 4 is best:

Round 3

Round 4

Note that we would not have selected [1, 2, 3] with regular forward selection. With regular forward selection, we would have had

[ ] -> [1] -> [1, 4] -> [1, 2, 4]

Or in other words, with regular forward selection, any feature set of length 3 would have included the value 4. However, in combination with other features, 4 may not be useful.

So, in practice, the floating forward variant allows us to test a few more combinations than the regular forward variant, without being as expensive as the exhaustive feature selection of course.

The floating backward variant works similar to the floating forward variant, except we are allowed to add a previously removed feature back.

jonathan-taylor commented 2 years ago

So it seems floating is as I thought. This is "both" - the stepwise will look to add or drop columns from current set. This is what "both" in R's step is.

My issue with the logic was just somehow how you tracked the progress within self.subsets_ as there is no longer necessarily a single model of size 3, say. The generic one just stores every state visited.

Get Outlook for Androidhttps://aka.ms/ghei36


From: Sebastian Raschka @.> Sent: Tuesday, October 5, 2021 6:49:56 PM To: rasbt/mlxtend @.> Cc: Jonathan Taylor @.>; Mention @.> Subject: Re: [rasbt/mlxtend] WIP: having exhaustive feature selector take a custom iterator as input (#834)

I think EFS is straightforward, but SFS I have to understand the "floating" logic a little better -- I gather this is just "search both forward and backward" but somehow in self.subsets_​ we only keep a given model of a fixed size -- is this meant to be the best model of a fixed size? Or maybe I still have the logic wrong.

Oh, I should have described this a bit better in the docs. In the floating forward variant, it allows you to delete a previously selected feature if it improves performance. In the floating backward variant, it allows you to add back a previously deleted feature.

Maybe it is easier to explain with a toy example.

Say we have a feature set [0, 1, 3, 4, 5] and we do sequential "floating" forward selection.

Round 1

Starting with the empty set [], we check all features and find that feature 1 is best:

Round 2

Starting with the set [1], we check all features and find that feature 4 is best:

Round 3

Round 4

Note that we would not have selected [1, 2, 3] with regular forward selection. With regular forward selection, we would have had

[ ] -> [1] -> [1, 4] -> [1, 2, 4]

Or in other words, with regular forward selection, any feature set of length 3 would have included the value 4. However, in combination with other features, 4 may not be useful.

So, in practice, the floating forward variant allows us to test a few more combinations than the regular forward variant, without being as expensive as the exhaustive feature selection of course.

The floating backward variant works similar to the floating forward variant, except we are allowed to add a previously removed feature back.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rasbt/mlxtend/pull/834#issuecomment-935238866, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACTM22FMPR6UAAKBHH64ZTUFOTMJANCNFSM47DDCTAA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

rasbt commented 2 years ago

Yeah, I think with the new version it will be way easier to track.

Regarding "both" there is a difference between floating-forward and floating-backward though. They can both give you different end results. I think "both" is currently doing floating-forward?

rasbt commented 2 years ago

Thanks for the updates. Actually, some of the failing tests are related to sklearn 1.0. Due to bug fixes or changes, their have been very slight changes in some of the accuracy values, which will cause values when comparing the results on 3 positions after the decimal point. I am currently looking into this and will update the tests

jonathan-taylor commented 2 years ago

Got it. I am pretty comfortable with the state of the code, but I think more tests are needed on my end.

It is possible to rewrite the two current estimators in this form. Do you think that's a high priority?


From: Sebastian Raschka @.> Sent: Tuesday, November 2, 2021 11:42 AM To: rasbt/mlxtend @.> Cc: Jonathan Taylor @.>; Mention @.> Subject: Re: [rasbt/mlxtend] WIP: having exhaustive feature selector take a custom iterator as input (#834)

Thanks for the updates. Actually, some of the failing tests are related to sklearn 1.0. Due to bug fixes or changes, their have been very slight changes in some of the accuracy values, which will cause values when comparing the results on 3 positions after the decimal point. I am currently looking into this and will update the tests

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rasbt/mlxtend/pull/834#issuecomment-958032536, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACTM23DC62Z5USSFS7MGM3UKA5I7ANCNFSM47DDCTAA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

rasbt commented 2 years ago

I am not sure how many people use the floating versions. One option could be to deprecate/remove them. We could maybe add a module mlxtend.legacy for that where we can keep the old sequential feature selection code for a while. However, I think think floating variants can be quite powerful. Just looking at the paper again (attached), here is a comparison of regular forward and backward selection compared to an exhaustive ("optimal") method: Screen Shot 2021-11-02 at 1 52 21 PM 1-s2.0-0167865594901279-main.pdf

Here it is compared to the floating version:

![Screen Shot 2021-11-02 at 1 52 06 PM](https://user- Screen Shot 2021-11-02 at 1 54 56 PM images.githubusercontent.com/5618407/139927438-77579a31-6f2a-48ff-8820-8deecd7eb76e.png)

So I would say having SFFS (floating forward) and SFBS (floating backward) is useful. But I can understand if it is too much work in terms of rewriting all the code so having only one of the two is probably sufficient (given that performance is very close anyway)

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

rasbt commented 2 years ago

Finally got around fixing the CI. Then, I was trying to rebase this PR but since it is not in a feature branch but in the master branch itself, I couldn't figure out how. I manually applied all the changes to this PR, and it appears I created quite a mess

rasbt commented 2 years ago

Actually not sure why the workflow CI doesn't run. It worked fine in another PR that I just closed a few minutes ago. Maybe it would be easiest to just close this PR, and make a new PR with the 4 files you had initially submitted, i.e,

if you could do this in a separate feature branch instead of master, I think this would make things easier. Let me know if I can help with anything