API for counting coalescing pairs

nspope commented 3 months ago

Core algorithm, tests and API that partially address #2904

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.61%. Comparing base (d1f81e2) to head (f7078da).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #2915 +/- ## ========================================== - Coverage 89.68% 89.61% -0.07% ========================================== Files 29 29 Lines 30391 30176 -215 Branches 5907 5874 -33 ========================================== - Hits 27255 27043 -212 Misses 1793 1793 + Partials 1343 1340 -3 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.21% <ø> (ø)` | | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `88.72% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `98.97% <100.00%> (+0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [python/tskit/stats.py](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915?src=pr&el=tree&filepath=python%2Ftskit%2Fstats.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-cHl0aG9uL3Rza2l0L3N0YXRzLnB5) | `100.00% <ø> (+0.77%)` | :arrow_up: | | [python/tskit/trees.py](https://app.codecov.io/gh/tskit-dev/tskit/pull/2915?src=pr&el=tree&filepath=python%2Ftskit%2Ftrees.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-cHl0aG9uL3Rza2l0L3RyZWVzLnB5) | `98.84% <100.00%> (+0.07%)` | :arrow_up: |

nspope commented 3 months ago

@jeromekelleher and @petrelharp this is ready for a look whenever you have time.

Shall I move the core algorithm into the C library in this PR or in a followup?

petrelharp commented 3 months ago

The remainder method is very clever. However, it does involve a lot of adding & subtracting of close numbers: is floating point error liable to be an issue? If there are n samples and a sequence length L, then a node present for O(1) sequence length will have their coalescing pairs computed as n^2 * (L + O(1)) - n^2 * L; I suppose this is only a problem if 1 / L is machine epsilon? So, not a worry?

petrelharp commented 3 months ago

This looks great - I've made some comments; and perhaps it needs another test case for which some samples might be parents to other samples? (e.g., produced by the nonWF simulator in the test suite, I think?)

nspope commented 3 months ago

I suppose this is only a problem if 1 / L is machine epsilon

Right -- if you've got a node with span S in a sequence of length L, you'll lose bits if S / L < machine epsilon. In practice it seems that this won't happen, and could be detected easily enough. What do you think, @jeromekelleher?

jeromekelleher commented 3 months ago

Seems pretty unlikely to me

petrelharp commented 3 months ago

We discussed changing the name, but I've forgotten to what? And, to be clear, i think Jerome's suggesting modifying this python algorithm, not adding a new one.

nspope commented 3 months ago

This is ready for another look ...

Addressed @petrelharp's comments -- algorithm had to be modified slightly to work when internal nodes are flagged as samples.
Changed name to 'pair_coalescence_counts'. The idea is this'll have a twin, 'pair_coalescence_rates' that invokes 'pair_coalescence_counts' under the hood and converts counts to rates within time windows or quantile intervals.
Added a 'time_discretisation' argument that maps nodes to time intervals. When there's a ton of nodes / windows / sample sets the output array may get too big to store in memory. In which case this lets one get a more compact summary that can be used to e.g. calculate coalescence rates. By default this is a string "nodes"; otherwise an array of breakpoints ending at np.inf.

nspope commented 3 months ago

i think Jerome's suggesting modifying this python algorithm, not adding a new one.

But in a followup PR, right?

jeromekelleher commented 3 months ago

Any idea why codecov isn't working here @benjeffery? It's making the diff impossible to read here, which is worse than useless.

nspope commented 3 months ago

If you like you can drop this argument from the initial version and make an issue to track?

Thanks @jeromekelleher, it'd be great to sort this API out now as it's very close.

re: naming, @petrelharp brought up that the stats methods will eventually have a time_windows argument, but that the behavior will be a bit different because pair_coalescence_counts needs a (default) time discretisation option of "nodes" as well as an option to pass time window breakpoints. So we could either:

Name this argument something other than time_windows -- I think time_bins is fine (definitely better than time_discretisation)
Call this argument time_windows and document that it accepts an additional string option that won't be accepted by other stats methods

Either is fine by me. Do you have a preference @petrelharp?

jeromekelleher commented 3 months ago

I think having this method accept an additional argument to time_windows here would be totally fine. It would be more confusing to use a different name just to avoid that minor bit of inconsistency I think.

benjeffery commented 3 months ago

Any idea why codecov isn't working here @benjeffery? It's making the diff impossible to read here, which is worse than useless.

I'm seeing a "Commit YAML is invalid" error at codecov - checking it out.

benjeffery commented 3 months ago

@mergifyio rebase

mergify[bot] commented 3 months ago

rebase

✅ Branch has been successfully rebased

benjeffery commented 3 months ago

@nspope I've rebased here to fix CI so you won't be able to push changes without a reset. Not sure how much of a git ninja you are so let me know if you need to push changes.

petrelharp commented 3 months ago

I think having this method accept an additional argument to time_windows here would be totally fine. It would be more confusing to use a different name just to avoid that minor bit of inconsistency I think.

Sounds good to me!

nspope commented 3 months ago

Thanks @benjeffery! Looks like codecov is failing lwt-tests and python-c-tests -- I don't think that's due to this PR?

nspope commented 3 months ago

Thanks @jeromekelleher -- I've modified so that time_windows argument accepts either "nodes" or a sorted array of breakpoints. Nodes that fall outside of time_windows aren't counted in the output.

nspope commented 3 months ago

Darn, it looks like codecov is still mangling the diff, though this is rebased onto 9e1bf0 ? Sorry Ben, maybe rebasing / pushing on my end screwed something up?

benjeffery commented 2 months ago

So the comment in https://github.com/tskit-dev/tskit/pull/2915#issuecomment-2029229828 is correct - only one line is missing coverage. However, previous comments made by codecov are not removed :( I think the only way to clear those would be to open a new PR.

nspope commented 2 months ago

Hmm, now the post-test Codecov upload is erroring out:

[2024-04-10T04:01:34.167Z] ['verbose'] The error stack is: Error: Error uploading to https://codecov.io: Error: There was an error fetching the storage URL during POST: 404 - {'detail': ErrorDetail(string='Unable to locate build via Github Actions API. Please upload with the Codecov repository upload token to resolve issue.', code='not_found')}

I'll go ahead and open a new PR ...

benjeffery commented 2 months ago

Hmm, now the post-test Codecov upload is erroring out:

[2024-04-10T04:01:34.167Z] ['verbose'] The error stack is: Error: Error uploading to https://codecov.io: Error: There was an error fetching the storage URL during POST: 404 - {'detail': ErrorDetail(string='Unable to locate build via Github Actions API. Please upload with the Codecov repository upload token to resolve issue.', code='not_found')}

I'll go ahead and open a new PR ...

Yeah, we sometimes get that one. I spent a while investigating it and it seems to be transient on codecov's end.

nspope commented 2 months ago

@benjeffery it may be related to v3 codecov-actions deprecation, see comment. Happy to wait and see if it resolves itself though

nspope commented 2 months ago

Well-- bumping codecov-actions to v4 got rid of the upload errors, but the coverage reports don't seem to be making their way here.

nspope commented 2 months ago

@benjeffery Sorry for the hassle ... but any chance you could clone and force-push this to re-trigger codecov? To see if the issue might be related to this PR coming from a fork? It seems like the coverage report is working fine in #2924

benjeffery commented 2 months ago

@mergifyio rebase

mergify[bot] commented 2 months ago

rebase

✅ Branch has been successfully rebased

benjeffery commented 2 months ago

Codecov issue now seems fixed here - it has removed all it's comments about uncovered lines.

jeromekelleher commented 2 months ago

Are we good to merge then @nspope?

nspope commented 2 months ago

Yes! Thanks for the fix @benjeffery

tskit-dev / tskit