tkellogg commented 1 year ago

Description

While the current functionality supports sparse matrices on an API level, it calls pd.DataFrame.values in order to get the number of rows in the DataFrame. Unfortunately, .values forces the entire DataFrame to be converted into a non-sparse 2D numpy array. So using sparse matrices to fix an OOM doesn't actually make the OOM go away.

This fix is semantically identical, in Pandas terms, except that it doesn't materialize a dense array just to find it's length.

Related issues or pull requests

N/A

Pull Request Checklist

[x] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
[ ] Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
[ ] Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
[x] Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
- NOTE: 5 tests don't pass, but that's consistent with the state prior to this change.
[x] Checked for style issues by running flake8 ./mlxtend

rasbt commented 1 year ago

Unfortunately, .values forces the entire DataFrame to be converted into a non-sparse 2D numpy array

Wow, good catch! Thanks for the PR

codecov[bot] commented 1 year ago

Codecov Report

Base: 77.45% // Head: 77.46% // Increases project coverage by +0.01% :tada:

Coverage data is based on head (29c97c0) compared to base (f248eb6). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1000 +/- ## ========================================== + Coverage 77.45% 77.46% +0.01% ========================================== Files 198 198 Lines 11171 11171 Branches 1406 1406 ========================================== + Hits 8652 8654 +2 + Misses 2305 2304 -1 + Partials 214 213 -1 ``` | [Impacted Files](https://codecov.io/gh/rasbt/mlxtend/pull/1000?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Sebastian+Raschka) | Coverage Δ | | |---|---|---| | [mlxtend/frequent\_patterns/fpmax.py](https://codecov.io/gh/rasbt/mlxtend/pull/1000/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Sebastian+Raschka#diff-bWx4dGVuZC9mcmVxdWVudF9wYXR0ZXJucy9mcG1heC5weQ==) | `91.20% <100.00%> (ø)` | | | [mlxtend/evaluate/counterfactual.py](https://codecov.io/gh/rasbt/mlxtend/pull/1000/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Sebastian+Raschka#diff-bWx4dGVuZC9ldmFsdWF0ZS9jb3VudGVyZmFjdHVhbC5weQ==) | `100.00% <0.00%> (+6.89%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Sebastian+Raschka). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Sebastian+Raschka)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

rasbt / mlxtend

Fix fpmax for sparse matrices #1000

Description

Related issues or pull requests

Pull Request Checklist

Codecov Report