Sparse Implementation of `build_coleman_forest` plus a fix for bottleneck import

ryanhausen commented 3 weeks ago

Reference Issues/PRs

Fixes #310

What does this implement/fix? Explain your changes.

This adds a sparse implementation of build_coleman_forest to treeple.stats. The sparse implementation relies on scipy.stats.sparse and so only works in the case of binary classification and regression. In my tests using cProfile/memray, the sparse implementation is a little over 50% faster and uses about %7 less memory, see pdf for more information. I am also attaching the profiling data if you'd like to take a look.

This PR also includres a change to the bottleneck implementation to for the dense implementation of build_coleman_forest. Specifically there are two important changes:

The check for the existence of the bottleneck package is now done using importlib.util.find_spec("bottleneck"). The old check using sys seemed to work fine and passed tests, but actually didn't seem work unless bottleneck had been imported before, which when the tests are run is true, but may not be the case in other situations
the aliased and lambda functions nanmean_f and anynan_f are now defined using def so that the warning message can moved into their respective calls rather than at import.

Any other comments?

The scipy.stats sparse implementation should be swapped out for pydata.sparse when that implementation is performant enough, as it is more general.

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 89.79592% with 10 lines in your changes missing coverage. Please review.

Project coverage is 80.27%. Comparing base (1d970b0) to head (d733508). Report is 3 commits behind head on main.

Files	Patch %	Lines
treeple/stats/forest.py	81.39%	5 Missing and 3 partials :warning:
treeple/stats/utils.py	96.36%	0 Missing and 2 partials :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #317 +/- ## ========================================== + Coverage 80.00% 80.27% +0.26% ========================================== Files 24 24 Lines 2221 2296 +75 Branches 411 422 +11 ========================================== + Hits 1777 1843 +66 - Misses 312 317 +5 - Partials 132 136 +4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ryanhausen commented 3 weeks ago

@adam2392 I have updated the PR, let me know if I missed anything or if you want to see some more changes.

ryanhausen commented 2 weeks ago

@adam2392 I added some documentation to address your comments. Let me know what you think.

adam2392 commented 2 weeks ago

I will let @sampan501 @SUKI-O @YuxinB @PSSF23 take a look as well.

neurodata / treeple