tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Python LD Matrix API #2902

Closed lkirk closed 5 months ago

lkirk commented 5 months ago

Description

As described in #2829, I've implemented the python API for the ld_matrix code. It should cover everything we described for the sites mode (barring the one comment I recently left on that issue). The test coverage handles odd edge cases (in the form of the mutated topology examples), the happy path, as well as input validation.

Do I need to create documentation before this can go out or can I handle it in a subsequent PR?

PR Checklist:

codecov[bot] commented 5 months ago

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (1601a5e) 89.72% compared to head (d763de8) 89.73%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902/graphs/tree.svg?width=650&height=150&src=pr&token=7RoAMA56V2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) ```diff @@ Coverage Diff @@ ## main #2902 +/- ## ========================================== + Coverage 89.72% 89.73% +0.01% ========================================== Files 30 30 Lines 30296 30360 +64 Branches 5896 5903 +7 ========================================== + Hits 27183 27245 +62 - Misses 1781 1782 +1 - Partials 1332 1333 +1 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.12% <ø> (-0.01%)` | :arrow_down: | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `67.85% <97.01%> (+0.12%)` | :arrow_up: | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `98.96% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c/tskit/trees.c](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-Yy90c2tpdC90cmVlcy5j) | `90.18% <ø> (-0.01%)` | :arrow_down: | | [python/\_tskitmodule.c](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-cHl0aG9uL190c2tpdG1vZHVsZS5j) | `88.72% <97.01%> (+0.07%)` | :arrow_up: | ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Last update [1601a5e...d763de8](https://app.codecov.io/gh/tskit-dev/tskit/pull/2902?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev).
lkirk commented 5 months ago

I'm a little bit confused by the code coverage annotations in the _tskitmodule.c file, I'm pretty sure that there are some false positives. I can do a more thorough check, I was wondering if it's a known issue.

I went ahead and made an initial cut of documentation so that this python code at least has some barebones documentation. I can add demos and examples in a few days or in a subsequent PR.

I've attached a pdf render of the docs I've created so far. The ld_matrix docstring makes it into the TreeSequence methods. I've also added a section inside of multi-site statistics.

lkirk commented 5 months ago

I'm not sure I agree with the plumbing setup though: I think we should have functions methods called D_matrix, D2_matrix`` etc rather than have the stat specified as a string (although that may be a useful sugar to add later).

I see how you've suggested how to structure things, I think I agree, but I'm wondering if we'd expose all of the matrix methods outside of _ll_tree_sequence (in trees.py) or if we would wrap them all in ld_matrix. My sense from the discussion on #2829 is that we'd have one unified LD interface, since as @apragsdale points out, there are many more statistics to implement.

For the _tskitmodule.c coverage, you need to add some basic tests to the test_lowlevel.py file which stresses the interface. The rest of the suite doesn't count towards coverage of the C code.

I see, I can put my input validation tests there to get the test coverage up.

For testing, let's stay miles away from test_tree_stats.py, it's horrifying. Please use test_divmat.py as an example for how we're hoping to structure the tests from now on.

I remember you saying this. I'm sorry that I ended up importing it, I got focused on test cases and the various edge cases and got lost in the weeds on that one. I think I should just delete what I have and put the python tests into another PR altogether.

This is quite a large PR again, it really would help to try and keep things to smaller so I can give feedback earier in the development cycle. Or, if the changes are necessarily large, opening a draft PR at an early stage to I can comment on earlier versions.

This is something I'm still trying to improve on. That's why I reached out on slack to gauge how to split things up. Can I propose that I remove all documentation and high-level python testing, focusing specifically on code that is present in _tskitmodule.c, trees.py, and the minimal things that will go into test_lowlevel.py and leave it at that?

In fact, I could just close this PR and open a smaller one if that'd be cleaner.

After we agree on this layer, I'll open up a PR for extensive python testing and documentation separately. Does that sound like more manageable chunks?

jeromekelleher commented 5 months ago

About the method calls - do you envision splitting out each summary function to its own method call, but keeping the ld_matrix(..., stat="r") high-level interface? There are relatively few stats implemented right now, maybe just half a dozen,

You're right @apragsdale - I think it would be nice to have a ld_matrix(..., stat="x") function. I'm just saying we should implement the dispatch layer in Python rather than C. A golden rule here is to not do anything complicated in C unless you have to.

jeromekelleher commented 5 months ago

My suggestion here to move things along is to implement the minimal changes for _tskitmodule.c with corresponding tests in test_lowlevel.py, which has a 1-1 mapping from C API functions to low-level Python-C functions. There's a bit of boilderplate involved, but in my experience it's very stable boilerplate - you do it once, make sure it's tested and then forget about it.

How does that sound?

lkirk commented 5 months ago

My suggestion here to move things along is to implement the minimal changes for _tskitmodule.c with corresponding tests in test_lowlevel.py, which has a 1-1 mapping from C API functions to low-level Python-C functions. There's a bit of boilderplate involved, but in my experience it's very stable boilerplate - you do it once, make sure it's tested and then forget about it.

How does that sound?

That sounds good. I'll rework this PR into what you've suggested.

lkirk commented 5 months ago

I've simplified things and added the per-stat matrix methods to the TreeSequence object

lkirk commented 5 months ago

Okay, I think that addresses your concerns. The one thing I haven't been able to do (mostly because I'm not sure how) is to leak check this code. Is that even possible? I don't see any CI tests for something like that.

jeromekelleher commented 5 months ago

The one thing I haven't been able to do (mostly because I'm not sure how) is to leak check this code. Is that even possible?

Yeah, this is a pain. Running in a loop and watching top is the best way I've found. Memray might do it now, though, I haven't tried it for this purpose.

lkirk commented 5 months ago

Cleaned up memory leaks, added tests for column sites. Good catch on that memory leak, I have tried memray in the past, but hadn't thought to run it multiple times to accumulate a noticeable amount of leaking memory. That seemed to do the trick.

Here's a plot of memory over time before I fixed the leak: image And after the leak was fixed: image

So, it seems like this isn't leaking memory anymore. Less satisfying than using valgrind, but I think I'm convinced.

I've preemptively squashed down the commits, but let me know if there's anything else to fix.

jeromekelleher commented 5 months ago

Great to see this in.

Next step is to get the top-level Python interface in. I suggest we get the bare-bones in, without documentation first, so that we can be reasonably relaxed about the parameters etc, and then spend a period evaluating it performance and usablity-wise. Once we're happy, then we do the final documentation.

If we document prematurely, it prevents us pushing out quick bugfix releases on tskit until we're 100% happy with the final API. Better to not document initially, and evaluate.

Hopefully you can cherry-pick out the commits from your earlier branch to do this?

Sound good?

jeromekelleher commented 5 months ago

Also great to know that memray works for this now, much better than staring at top! Maybe we should add a note about this to the Python C API dev docs?

lkirk commented 5 months ago

Also great to know that memray works for this now, much better than staring at top! Maybe we should add a note about this to the Python C API dev docs?

Sure, I'll draft up a PR for that.

lkirk commented 5 months ago

Great to see this in.

Next step is to get the top-level Python interface in. I suggest we get the bare-bones in, without documentation first, so that we can be reasonably relaxed about the parameters etc, and then spend a period evaluating it performance and usablity-wise. Once we're happy, then we do the final documentation.

If we document prematurely, it prevents us pushing out quick bugfix releases on tskit until we're 100% happy with the final API. Better to not document initially, and evaluate.

Hopefully you can cherry-pick out the commits from your earlier branch to do this?

Sound good?

Absolutely, I'll move the previous commits to a new branch and open a PR. I think something that was a bit confusing to me when making contributions is the PR checklist, it sort of implies that everything should come as a package. Docs, tests, along with the implementation. It seems, at least for things that are a bit more involved, we want more of a piecemeal approach. I suppose I really should have been reading this in more detail :/. I see what you mean about documentation blocking releases: once documentation goes up, it's public.

jeromekelleher commented 5 months ago

Yeah, I'm a bit dubious on the value of those checklists to be honest. There's only a very limited set of circumstances where they apply, and most of the time they are ignored or inappropriate.