tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Implement efficient seeking from non-null trees using tree_pos #2911

Open duncanMR opened 7 months ago

duncanMR commented 7 months ago

Description

Continuing from #2874, we want to finish moving over the tree-positioning code to use tree_pos efficiently. At the moment, tsk.tree_seek will either call tsk_tree_seek_from_null or tsk_tree_seek_linear depending on whether we are starting from the null tree or not. seek_linear repeatedly calls next or prev until it reaches the given position, with the direction being determined by which would cover the shortest distance.

As a first pass, I've implemented tsk_tree_seek_forward and tsk_tree_seek_backward and I've incorporated them into tsk_tree_seek_linear. We will need to revise some of the test_highlevel.py seek tests, because the direction we choose to seek along is different to the old approach in some cases. For example, we now seek forward to go from the first to the last tree in a sequence.

Curiously, my implementation passes all the C tests with no memory issues detected by Valgrind, and it also passes all the test_highlevel.py and test_tree_positioning tests except for the ones dependent on seeking direction. However, it has caused chaos with other Python tests, causing failures and segfaults in test_stats.py and test_divmat.py among others. The failing/crashing tests seem to be primarily be associated with LD calculations and divergence.

I'm currently trying to determine whether the problems are due to an error in my implemention (most likely) or the subtle problems with the time ordering of inserted edges, discussed in #2792.

PR Checklist:

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 9.21053% with 69 lines in your changes are missing coverage. Please review.

Project coverage is 89.59%. Comparing base (8b1be4f) to head (35dcda4).

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911/graphs/tree.svg?width=650&height=150&src=pr&token=7RoAMA56V2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) ```diff @@ Coverage Diff @@ ## main #2911 +/- ## ========================================== - Coverage 89.79% 89.59% -0.21% ========================================== Files 30 30 Lines 30399 30472 +73 Branches 5909 5922 +13 ========================================== + Hits 27296 27300 +4 - Misses 1778 1846 +68 - Partials 1325 1326 +1 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `85.79% <2.81%> (-0.42%)` | :arrow_down: | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `67.71% <100.00%> (+<0.01%)` | :arrow_up: | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `98.96% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [python/\_tskitmodule.c](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-cHl0aG9uL190c2tpdG1vZHVsZS5j) | `88.73% <100.00%> (+<0.01%)` | :arrow_up: | | [c/tskit/trees.c](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-Yy90c2tpdC90cmVlcy5j) | `89.05% <2.81%> (-1.38%)` | :arrow_down: | ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Last update [8b1be4f...35dcda4](https://app.codecov.io/gh/tskit-dev/tskit/pull/2911?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev).
jeromekelleher commented 7 months ago

That is interesting. I can see the LD calculator causing problems as that does a lot of seeking backward and forward (was the original motivation for bidirectional seeking). I don't see anything obvious wrong, but it must be something to do with sample counts.

One thing that we can at least make progress on is that I think we should keep the option of seeking linearly. So, we make a new option to seek, TSK_SEEK_LINEAR which always uses the linear seek algorithm. We'll want to expose this to Python also, so there'll be a bit of plumbing involved.

We should specify this option in the ld_calculator, as that is definitely somewhere that linear seeking makes sense.

duncanMR commented 7 months ago

That is interesting. I can see the LD calculator causing problems as that does a lot of seeking backward and forward (was the original motivation for bidirectional seeking). I don't see anything obvious wrong, but it must be something to do with sample counts.

One thing that we can at least make progress on is that I think we should keep the option of seeking linearly. So, we make a new option to seek, TSK_SEEK_LINEAR which always uses the linear seek algorithm. We'll want to expose this to Python also, so there'll be a bit of plumbing involved.

We should specify this option in the ld_calculator, as that is definitely somewhere that linear seeking makes sense.

Okay; that makes sense to me. I'll revert seek_linear to its previous form and move the seek algorithm that uses tree_pos to a new function. I'll make the tree_pos method the default for seek, then see exactly which methods need to be changed back to linear in order for the tests to pass. Shall we call the new tree_pos method tsk_tree_seek_nonlinear?

jeromekelleher commented 7 months ago

seek_skip might be a bit more descriptive?