Integrate scikit-learn's `set_output` method into `TransactionEncoder`

it176131 commented 3 months ago

Code of Conduct

Description

This defines the :method:get_feature_names_out in :class:TransactionEncoder to expose the :method:set_output.

Related issues or pull requests

Fixes #1085

Pull Request Checklist

[x] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
- See 1c4d3284f29112457cbcd7548f3bb2483a6fa81f
[x] Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
- See 45cb6cd5d86fcd037107e318225d722bd98d5668, b21bb21e8461225f455e8ea19c5f9cb948e16f29, and 1a02dd5aa8c935f65b3951024c6e75cf86fa081b
[x] Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
- See 35dbeef60de441609d3c0d1d21d6c928e7c16daf
[x] Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
- 20 preexisting unit tests did not pass before work was started. I added two new unit tests to cover the work requested. Both new tests pass and the new work does not affect the prior tests.
[x] Checked for style issues by running flake8 ./mlxtend
- Prior work outside of this PR does not pass flake8 v7.0.0 guidelines. See mlxtend/feature_selection/column_selector.py:81

review-notebook-app[bot] commented 3 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rasbt commented 3 months ago

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

it176131 commented 3 months ago

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

Sure thing! I wasn't sure what to put regarding newer version release dates so I bumped the patch number by one and set the release date to "TBD". Lmk if I should change anything.

it176131 commented 3 months ago

Hey @rasbt it looks like the pipeline checks keep failing for files unrelated to this PR. Should I log a separate issue and open a new PR to fix those too?

rasbt commented 3 months ago

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:

      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]

Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.

And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

it176131 commented 3 months ago

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:
      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]
Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.

And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

I'll do some digging. If I can replicate the issue and come up with a solution I'll submit it in a separate PR.

rasbt commented 3 months ago

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

rasbt commented 3 months ago

Hm, I think the new failures could be due to the sklearn version bump

it176131 commented 3 months ago

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

I noticed that scikit-learn 1.1.3 is being installed in the github workflow. Can we bump it to 1.2.2 as that is required for set_output to work?

rasbt commented 3 months ago

Yes, please feel free to bump it up. We probably need to fix some other places that have not been adjusted for the most recent version though

it176131 commented 3 months ago

Running test_inverse_transform locally, it appears the error is raised when np.array(data_sorted) is called. This is because the data_sorted is list of lists where the nested lists may have varying lengths (same for oht.inverse_transform(expect)). I see two potential ways around this:

Use a simpler assertion like assert data_sorted == oht.inverse_transform(expect)
Add dtype="object" to the np.array constructors

Both result in the test passing. I'm in favor of the first option as it doesn't require any changes to the data_sorted and the expected output type is a list rather than an np.ndarray.

it176131 commented 3 months ago

Turns out scikit-learn version 1.2.2 had a bug in the set_output API that failed when the input was not a pandas.DataFrame. See this PR for details. The fix came in scikit-learn version 1.3.1. Bumping the scikit-learn version to it fixes the issue. Commit(s) to follow.

it176131 commented 3 months ago

Of course bumping the version results in more failed tests 🤦

I'm going to make a separate PR to handle the scikit-learn version bump and the failed unit tests. THEN maybe this will work. Didn't realize it was going to take so much work 😅

rasbt commented 3 months ago

Arg sorry. Yeah, a sklearn bump was overdue but I recently didn't have the time to look into it since I wasn't using the affected features. If this is too much work, don't worry about it, I can understand if you want to drop this. I could revisit the version bump in the upcoming weeks then, address this, and then merge your PR once it's addressed.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 78.32%. Comparing base (e82c9c5) to head (f018a8d).

:exclamation: Current head f018a8d differs from pull request most recent head 44961b5. Consider uploading reports for the commit 44961b5 to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1087 +/- ## ========================================== + Coverage 78.29% 78.32% +0.03% ========================================== Files 196 196 Lines 11140 11157 +17 Branches 1404 1404 ========================================== + Hits 8722 8739 +17 Misses 2200 2200 Partials 218 218 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

it176131 commented 3 months ago

@rasbt looks like all checks passed. Ready to merge 😎

rasbt / mlxtend