added bottleneck for nan calculations

ryanhausen commented 1 month ago

Reference Issues/PRs

None

What does this implement/fix? Explain

Changes nan related calculations in treeple.stats.utils to use bottleneck rather numpy. In my testing, this seems to offer a significant performance speed up, see attached doc: profiling.pdf

I am also attached the cProfile outputs for the benchmarking in the above pdf. profiles.zip

Any other comments?

I am opening as a draft as I am not sure if you want to make bottleneck a required dependency or optional. And if optional, how you want to categorize it.

Looking forward to your feedback!

adam2392 commented 1 month ago

I would categorize bottleneck as an optional dependency within a new category called 'extra'.

Feel free to add the relevant entry here: https://github.com/neurodata/treeple/blob/main/pyproject.toml

In addition, can you add a CHANGELOG entry?

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 78.66%. Comparing base (dd28c41) to head (7244b98). Report is 4 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #306 +/- ## ========================================== + Coverage 78.53% 78.66% +0.13% ========================================== Files 24 24 Lines 2250 2264 +14 Branches 413 417 +4 ========================================== + Hits 1767 1781 +14 Misses 352 352 Partials 131 131 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

adam2392 commented 1 month ago

Do you know how other packages test optional dependencies? It would be great to add a unit-test that runs the stuff w/ and w/o bottleneck to assert the answer is the same.

ryanhausen commented 1 month ago

EDIT: Sorry I didn't notice your comment above. Ya I wanted to add a test too. Right now bottleneck only touches two things, _non_nan_samples and _parallel_build_null_forests. I added a test for _non_nan_samples, but _parallel_build_null_forests does quite a bit and doesn't seem to have it's own test. It could be refactored into multiple more modular functions and then they could all be separately tested, but to follow the existing testing structure, I modified one of the other tests to turn off/on bottleneck.

@adam2392 Would you mind taking a look again. I made a couple changes to make it testable, added a test, and changed an existing test. I am failing codecov, but I'm not sure where, the link from codecov isn't showing me the file. I am sure it's me.

adam2392 commented 1 month ago

It shows this: Screenshot 2024-08-01 at 9 28 44 AM

ryanhausen commented 1 month ago

Interesting, ok thanks. I'll fix it.

ryanhausen commented 1 month ago

@adam2392 I fixed coverage locally. However, by design, bottleneck isn't included in the build because it's optional. I can add it to the build or test requirements files and that would force it into the CI process. Do you have a preference for which one?

adam2392 commented 1 month ago

Yes in the CI that tests coverage, we should add the extra group of installs.

ryanhausen commented 1 month ago

@adam2392 I got it added to the build step by overriding pip install in spin. It looks like the mac builds for python 3.9/10 aren't happy about something, but the others are ok. I don't have a mac to test it. Do you know have you seen this behavior before?

ryanhausen commented 1 month ago

build_and_test_slow uses spin for the install too right? Would installing the package via pip not create other issues rather than using spin?

ryanhausen commented 1 month ago

Would you mind rerunning those actions? If it doesn't go away, then maybe if there is an issue with bottleneck, that we need to know about

adam2392 commented 1 month ago

build_and_test_slow uses spin for the install too right? Would installing the package via pip not create other issues rather than using spin?

No for some unfortunate reasons, we use pip install from a requirements file: https://github.com/neurodata/treeple/blob/1e62fe3a9a3323bc2bfe1f7abbffdf5c32256c50/.github/workflows/main.yml#L180-L186

So I would just add a pip install .[extra] step there, revert the changes in .spin/cmds.py, and see if that addresses the coverage issue.

Note ./spin install will not install anything besides the treeple package, and its required dependencies, so all extra installations, such as for docs, testing, and style checks are separate.

adam2392 commented 1 month ago

Now that I think about it, you can also add bottleneck to the test group, since we always want it for unit-testing. If you go this route, you would also just add to the test_requirements.txt file.

ryanhausen commented 1 month ago

@adam2392 would you mind taking a look when you have the chance.

ryanhausen commented 1 month ago

@adam2392 sorry I missed your comment I merged with latest from main and updated the version in the docs. Looks like there was a weird error with one of the Mac builds. Doesn't seem related to the code though.

adam2392 commented 1 month ago

Cool. Thanks for the PR @ryanhausen! Can you take a look at what the differential is when using jovo's suggestion?

I.e. pre-allocate less nans than needed

ryanhausen commented 1 month ago

Yep!

neurodata / treeple