Closed lkirk closed 5 months ago
Attention: 2 lines
in your changes are missing coverage. Please review.
Comparison is base (
1601a5e
) 89.72% compared to head (d763de8
) 89.73%.
I'm a little bit confused by the code coverage annotations in the _tskitmodule.c
file, I'm pretty sure that there are some false positives. I can do a more thorough check, I was wondering if it's a known issue.
I went ahead and made an initial cut of documentation so that this python code at least has some barebones documentation. I can add demos and examples in a few days or in a subsequent PR.
I've attached a pdf render of the docs I've created so far. The ld_matrix
docstring makes it into the TreeSequence methods. I've also added a section inside of multi-site statistics
.
I'm not sure I agree with the plumbing setup though: I think we should have functions methods called
D_matrix
, D2_matrix`` etc rather than have the stat specified as a string (although that may be a useful sugar to add later).
I see how you've suggested how to structure things, I think I agree, but I'm wondering if we'd expose all of the matrix methods outside of _ll_tree_sequence
(in trees.py
) or if we would wrap them all in ld_matrix
. My sense from the discussion on #2829 is that we'd have one unified LD interface, since as @apragsdale points out, there are many more statistics to implement.
For the _tskitmodule.c coverage, you need to add some basic tests to the
test_lowlevel.py
file which stresses the interface. The rest of the suite doesn't count towards coverage of the C code.
I see, I can put my input validation tests there to get the test coverage up.
For testing, let's stay miles away from
test_tree_stats.py
, it's horrifying. Please use test_divmat.py as an example for how we're hoping to structure the tests from now on.
I remember you saying this. I'm sorry that I ended up importing it, I got focused on test cases and the various edge cases and got lost in the weeds on that one. I think I should just delete what I have and put the python tests into another PR altogether.
This is quite a large PR again, it really would help to try and keep things to smaller so I can give feedback earier in the development cycle. Or, if the changes are necessarily large, opening a draft PR at an early stage to I can comment on earlier versions.
This is something I'm still trying to improve on. That's why I reached out on slack to gauge how to split things up. Can I propose that I remove all documentation and high-level python testing, focusing specifically on code that is present in _tskitmodule.c
, trees.py
, and the minimal things that will go into test_lowlevel.py
and leave it at that?
In fact, I could just close this PR and open a smaller one if that'd be cleaner.
After we agree on this layer, I'll open up a PR for extensive python testing and documentation separately. Does that sound like more manageable chunks?
About the method calls - do you envision splitting out each summary function to its own method call, but keeping the ld_matrix(..., stat="r") high-level interface? There are relatively few stats implemented right now, maybe just half a dozen,
You're right @apragsdale - I think it would be nice to have a ld_matrix(..., stat="x")
function. I'm just saying we should implement the dispatch layer in Python rather than C. A golden rule here is to not do anything complicated in C unless you have to.
My suggestion here to move things along is to implement the minimal changes for _tskitmodule.c
with corresponding tests in test_lowlevel.py
, which has a 1-1 mapping from C API functions to low-level Python-C functions. There's a bit of boilderplate involved, but in my experience it's very stable boilerplate - you do it once, make sure it's tested and then forget about it.
How does that sound?
My suggestion here to move things along is to implement the minimal changes for
_tskitmodule.c
with corresponding tests intest_lowlevel.py
, which has a 1-1 mapping from C API functions to low-level Python-C functions. There's a bit of boilderplate involved, but in my experience it's very stable boilerplate - you do it once, make sure it's tested and then forget about it.How does that sound?
That sounds good. I'll rework this PR into what you've suggested.
I've simplified things and added the per-stat matrix methods to the TreeSequence
object
Okay, I think that addresses your concerns. The one thing I haven't been able to do (mostly because I'm not sure how) is to leak check this code. Is that even possible? I don't see any CI tests for something like that.
The one thing I haven't been able to do (mostly because I'm not sure how) is to leak check this code. Is that even possible?
Yeah, this is a pain. Running in a loop and watching top
is the best way I've found. Memray might do it now, though, I haven't tried it for this purpose.
Cleaned up memory leaks, added tests for column sites. Good catch on that memory leak, I have tried memray in the past, but hadn't thought to run it multiple times to accumulate a noticeable amount of leaking memory. That seemed to do the trick.
Here's a plot of memory over time before I fixed the leak:
And after the leak was fixed:
So, it seems like this isn't leaking memory anymore. Less satisfying than using valgrind, but I think I'm convinced.
I've preemptively squashed down the commits, but let me know if there's anything else to fix.
Great to see this in.
Next step is to get the top-level Python interface in. I suggest we get the bare-bones in, without documentation first, so that we can be reasonably relaxed about the parameters etc, and then spend a period evaluating it performance and usablity-wise. Once we're happy, then we do the final documentation.
If we document prematurely, it prevents us pushing out quick bugfix releases on tskit until we're 100% happy with the final API. Better to not document initially, and evaluate.
Hopefully you can cherry-pick out the commits from your earlier branch to do this?
Sound good?
Also great to know that memray works for this now, much better than staring at top
! Maybe we should add a note about this to the Python C API dev docs?
Also great to know that memray works for this now, much better than staring at
top
! Maybe we should add a note about this to the Python C API dev docs?
Sure, I'll draft up a PR for that.
Great to see this in.
Next step is to get the top-level Python interface in. I suggest we get the bare-bones in, without documentation first, so that we can be reasonably relaxed about the parameters etc, and then spend a period evaluating it performance and usablity-wise. Once we're happy, then we do the final documentation.
If we document prematurely, it prevents us pushing out quick bugfix releases on tskit until we're 100% happy with the final API. Better to not document initially, and evaluate.
Hopefully you can cherry-pick out the commits from your earlier branch to do this?
Sound good?
Absolutely, I'll move the previous commits to a new branch and open a PR. I think something that was a bit confusing to me when making contributions is the PR checklist, it sort of implies that everything should come as a package. Docs, tests, along with the implementation. It seems, at least for things that are a bit more involved, we want more of a piecemeal approach. I suppose I really should have been reading this in more detail :/. I see what you mean about documentation blocking releases: once documentation goes up, it's public.
Yeah, I'm a bit dubious on the value of those checklists to be honest. There's only a very limited set of circumstances where they apply, and most of the time they are ignored or inappropriate.
Description
As described in #2829, I've implemented the python API for the ld_matrix code. It should cover everything we described for the sites mode (barring the one comment I recently left on that issue). The test coverage handles odd edge cases (in the form of the mutated topology examples), the happy path, as well as input validation.
Do I need to create documentation before this can go out or can I handle it in a subsequent PR?
PR Checklist: