rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.82k stars 853 forks source link

Integrate scikit-learn's `set_output` method into `TransactionEncoder` #1087

Closed it176131 closed 3 months ago

it176131 commented 3 months ago

Code of Conduct

Description

This defines the :method:get_feature_names_out in :class:TransactionEncoder to expose the :method:set_output.

Related issues or pull requests

Fixes #1085

Pull Request Checklist

review-notebook-app[bot] commented 3 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

rasbt commented 3 months ago

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

it176131 commented 3 months ago

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

Sure thing! I wasn't sure what to put regarding newer version release dates so I bumped the patch number by one and set the release date to "TBD". Lmk if I should change anything.

it176131 commented 3 months ago

Hey @rasbt it looks like the pipeline checks keep failing for files unrelated to this PR. Should I log a separate issue and open a new PR to fix those too?

rasbt commented 3 months ago

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:

      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]

Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.

Screenshot 2024-03-28 at 8 35 52 AM

And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

it176131 commented 3 months ago

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:

      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]

Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.

Screenshot 2024-03-28 at 8 35 52 AM

And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

I'll do some digging. If I can replicate the issue and come up with a solution I'll submit it in a separate PR.

rasbt commented 3 months ago

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

rasbt commented 3 months ago

Hm, I think the new failures could be due to the sklearn version bump

it176131 commented 3 months ago

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

I noticed that scikit-learn 1.1.3 is being installed in the github workflow. Can we bump it to 1.2.2 as that is required for set_output to work?

rasbt commented 3 months ago

Yes, please feel free to bump it up. We probably need to fix some other places that have not been adjusted for the most recent version though

it176131 commented 3 months ago

Running test_inverse_transform locally, it appears the error is raised when np.array(data_sorted) is called. This is because the data_sorted is list of lists where the nested lists may have varying lengths (same for oht.inverse_transform(expect)). I see two potential ways around this:

  1. Use a simpler assertion like assert data_sorted == oht.inverse_transform(expect)
  2. Add dtype="object" to the np.array constructors

Both result in the test passing. I'm in favor of the first option as it doesn't require any changes to the data_sorted and the expected output type is a list rather than an np.ndarray.

it176131 commented 3 months ago

Turns out scikit-learn version 1.2.2 had a bug in the set_output API that failed when the input was not a pandas.DataFrame. See this PR for details. The fix came in scikit-learn version 1.3.1. Bumping the scikit-learn version to it fixes the issue. Commit(s) to follow.

it176131 commented 3 months ago

Of course bumping the version results in more failed tests 🤦

I'm going to make a separate PR to handle the scikit-learn version bump and the failed unit tests. THEN maybe this will work. Didn't realize it was going to take so much work 😅

rasbt commented 3 months ago

Arg sorry. Yeah, a sklearn bump was overdue but I recently didn't have the time to look into it since I wasn't using the affected features. If this is too much work, don't worry about it, I can understand if you want to drop this. I could revisit the version bump in the upcoming weeks then, address this, and then merge your PR once it's addressed.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 78.32%. Comparing base (e82c9c5) to head (f018a8d).

:exclamation: Current head f018a8d differs from pull request most recent head 44961b5. Consider uploading reports for the commit 44961b5 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1087 +/- ## ========================================== + Coverage 78.29% 78.32% +0.03% ========================================== Files 196 196 Lines 11140 11157 +17 Branches 1404 1404 ========================================== + Hits 8722 8739 +17 Misses 2200 2200 Partials 218 218 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

it176131 commented 3 months ago

@rasbt looks like all checks passed. Ready to merge 😎