tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Add tree_pos to tsk_tree_t struct and change next, prev and seek_from_null to use tree_pos #2874

Closed duncanMR closed 4 months ago

duncanMR commented 7 months ago

Description

This PR aims to add tree_pos to the tsk_tree_t struct and alter next, prev, first, last and seek_from_null to use the tree positioning code. This means that we can remove the separate tsk_tree_position iterator in LSHMM and KC distance code.

Fixes #2794, Fixes #2795 and Fixes #2825.

PR Checklist:

duncanMR commented 7 months ago

@jeromekelleher Unfortunately the CI issues remain on the fresh branch. Is there any way to force update the package cache?

codecov[bot] commented 7 months ago

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (813e361) 89.75% compared to head (a614344) 89.79%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874/graphs/tree.svg?width=650&height=150&src=pr&token=7RoAMA56V2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) ```diff @@ Coverage Diff @@ ## main #2874 +/- ## ========================================== + Coverage 89.75% 89.79% +0.03% ========================================== Files 30 30 Lines 30395 30399 +4 Branches 5912 5909 -3 ========================================== + Hits 27282 27296 +14 + Misses 1781 1778 -3 + Partials 1332 1325 -7 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.21% <93.37%> (+0.07%)` | :arrow_up: | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `80.78% <ø> (ø)` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `67.71% <ø> (ø)` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `98.96% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c/tskit/haplotype\_matching.c](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-Yy90c2tpdC9oYXBsb3R5cGVfbWF0Y2hpbmcuYw==) | `84.55% <100.00%> (+0.07%)` | :arrow_up: | | [c/tskit/trees.c](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#diff-Yy90c2tpdC90cmVlcy5j) | `90.42% <93.18%> (+0.20%)` | :arrow_up: | ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Last update [813e361...a614344](https://app.codecov.io/gh/tskit-dev/tskit/pull/2874?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev).
jeromekelleher commented 7 months ago

@benjeffery what'st the easiest way of dealing with this?

duncanMR commented 7 months ago

Ben suggested bumping the version number in the cache config, which I did in the last commit, but for some reason it seems to have not refreshed the cache.

benjeffery commented 7 months ago

Bumping the number will certainly have flushed the cache. Think there is a deeper issue here - checking it now.

benjeffery commented 7 months ago

BTW - The CI cache works across PRs, so a fresh one doesn't have any effect.

Hoping https://github.com/tskit-dev/tskit/pull/2875 fixes this CI issue.

duncanMR commented 7 months ago

Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue

benjeffery commented 7 months ago

Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue

If you need to rebase to main and the PR is other wise complete you can use "mergifyio rebase" replacing the with a @.

duncanMR commented 7 months ago

I think I've covered all the cases where tsk_tree_position_t was used outside of tsk_tree_t. I haven't figured out how to resolve the codecov warning: the tests implicitly check if tsk_tree_position_init returns a nonzero value within tsk_tree_init, and I've added checks that all the initialised values are correct.

jeromekelleher commented 7 months ago

I would drop the last two commits for now @duncanMR, and focus on changing tree methods to use the tree_pos

duncanMR commented 7 months ago

I've moved the last two commits into another branch for now. I'm going to have to pause this for now, but I'll continue revising the tree positioning code next year.

duncanMR commented 5 months ago

I've moved over all the tree positioning functions to use tree_pos; for now, I've made seek_from_null just call tsk_tree_next repeatedly for now, until we implement seek_forward. All tests are passing except for 21 in the class tests/test_highlevel.py::TestSeekDirection. I've narrowed down the cause: the left_child and right_sib arrays are calculated incorrectly when calling prev from the null state:

from tests import tsutil
from tests import test_tree_positioning
ts = tsutil.all_trees_ts(3)
tree_state = test_tree_positioning.StatefulTree(ts)
tree_state.prev()
print(tree_state)
parent: [3, 3, 4, 4, -1]
position:
    index: 3
    interval: Interval(left=3.0, right=4.0)
    direction: -1
    in_range: EdgeRange(start=8, stop=4, order=array([0, 5, 2, 6, 4, 8, 7, 3, 1], dtype=int32))
    out_range: EdgeRange(start=8, stop=8, order=array([0, 2, 4, 5, 8, 1, 6, 3, 7], dtype=int32))

So we will insert edges 1 ($3 \rightarrow 0$), 3 ($3 \rightarrow 1$), 7 ($4 \rightarrow 2$) and 8 ($4 \rightarrow 3$) in that order. Because edge 7 is inserted first, we set left_child(4) = 2 and right_sib[2] = 3. This is incorrect, because it is supposed to be left_child(4) = 3 and right_sib(3)=2, as we can see from the tree plot:

image

I'm still trying to figure out how the system for node ordering works. I think we either need to change the in_range calculation, or we have to revise how left_child and right_sib are updated, but I'm not sure yet.

jeromekelleher commented 5 months ago

Hmm, the ordering of children for a parent is documented as arbitrary, and we can change it. If that's the only problem then I wouldn't consider it a bug.

Is it just those tests expect a particular ordering?

duncanMR commented 5 months ago

Yes, it seems that test_highlevel.py::TestSeekDirection is the only test class that checks the child and sib arrays directly with the assert_trees_identical function. The tests fail because the child and sib arrays are calculated differently when reaching the last tree in a sequence with prev or next. I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

jeromekelleher commented 5 months ago

I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different left_sib arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.

So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.

duncanMR commented 5 months ago

I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?

It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different left_sib arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.

So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.

I see. Yes, the right_sib and left_child have just been permuted at one position; here, Tree 2 is what the current library code outputs and Tree 1 is the result of calling prev from the null tree, which used to give the same result prior to my changes:

Tree 1 left_child: [-1 -1 -1  0  2  4]
Tree 2 left_child: [-1 -1 -1  0  3  4]
Tree 1 left_sib:   [-1  0 -1  2 -1 -1]
Tree 2 left_sib:   [-1  0  3 -1 -1 -1]

The trees are identical otherwise. Should I just remove the checks for whether the left/right child/sib arrays are identical for this test?

jeromekelleher commented 5 months ago

Yes for now can just comment out those tests so we can see how the rest of the suite is doing (would come up with a better solution before merging)

duncanMR commented 5 months ago

I've got seek_from_null working using seek_forward and seek_backward. I just need to add some tests now and check that the performance hasn't regressed.

duncanMR commented 4 months ago

I had some trouble replicating Kevin Thornton's performance plots for seek_from_null, since my timing data for each seek has higher variance and I don't have his benchmarking code to compare with. Strangely, the times I get for the new tree_pos implementation have tended to have higher variance; sometimes, a single seek takes an order of magnitude longer than normal. Overall, it seems like the new implementation is comparable in performance to the old method, but slightly slower on average due to the occasionally long seek times:

import stdpopsim
import tskit
import time
import pandas as pd

species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("OutOfAfrica_3G09")
contig = species.get_contig("chr1", length_multiplier=0.2)
samples = {"YRI": 0, "CHB": 0, "CEU": 2e4}
engine = stdpopsim.get_engine("msprime")
test_ts = engine.simulate(model, contig, samples, seed=1)

def seek_from_null_sweep(ts, method):
    data = []

    for n in range(0, ts.num_trees - 1, 500):
        tree = tskit.Tree(ts)
        before = time.perf_counter()
        tree.seek(n)
        duration = time.perf_counter() - before
        data.append([n, duration, method])

    # Convert the list to a DataFrame
    df = pd.DataFrame(data, columns=['Index', 'Time', 'Method'])
    return df

#lib_null_df = seek_from_null_sweep(test_ts, method="lib")
tree_pos_null_df = seek_from_null_sweep(test_ts, method="tree_pos")

image

We can't determine a clear difference between the methods from a single run, but if I repeatedly time how long it takes to seek from null to various points along the genome, the difference between the two implementations does seem significant in some cases. Here, I seeked to 1000 indices in the HGDP and msprime simulated trees, and only 10 indices in the sc2ts trees. This how I timed it:

def test_seek_from_null(ts, num_runs, num_indices):
    times = np.zeros(num_runs)
    tree = tskit.Tree(ts)
    indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
    for run in range(num_runs):
        before = time.perf_counter()
        for n in indices:
            tree.clear()
            tree.seek(n)
        times[run] = time.perf_counter() - before
    return np.mean(times), np.std(times)

Here are the results, including the tests for other tree positioning methods (the times are averaged over 30 runs for seek_from_null and 50 runs for the other tests):

image

@jeromekelleher Do you think this warrants further investigation / profiling? My test setup is far from ideal: background processes and the complexity of WSL + Jupyter + Windows causes significant variation in the timings I calculate. In my first experiments, the performance of the new method was actually worse, but I couldn't replicate that later on.

molpopgen commented 4 months ago

I don't totally remember the details but my benchmarks were probably all done using a prototype written using tskit-rust and are now lost to history.

jeromekelleher commented 4 months ago

Interesting - what's the percentage difference in the average seek_from_null times? I guess we should run it through perf or something to see where the time is being spent in the low-level code.

duncanMR commented 4 months ago

I wrote a better performance test that calculates mean seek times for each of a subset of indices

def test_seek_from_null(ts, ts_name, num_runs, num_indices, method):
    times_matrix = np.zeros((num_runs, num_indices))
    indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)

    for run in range(num_runs):
        for i, n in enumerate(indices):
            tree = tskit.Tree(ts)
            before = time.perf_counter()
            tree.seek(n)
            after = time.perf_counter()
            times_matrix[run, i] = (after - before) * 1000000

    mean_times = np.mean(times_matrix, axis=0)
    std_times = np.std(times_matrix, axis=0)

    # Create a DataFrame to return
    results_df = pd.DataFrame({
        'index': indices,
        'mean': mean_times,
        'sd': std_times,
        'method': method,
        'ts': ts_name
    })

    return results_df

I computed the seek times for three TS, starting with a huge simulated ts:

image

The new algorithm performs substantially better toward the end of the sequence (results are averaged over 200 runs)

image

The variance in seek times is extremely high in the HGDP case for both methods, but the implementations perform within measurement error for each other

image

I had to reduce the number of runs to 50 for sc2ts because it takes so long. It's another clear win for the new implementation

image

I've updated the LSHMM and KC distance code. I removed the tsk_tree_position_step function because it is actually useless (we never want to update tsk_tree_position separately to the tree now). I think it's ready for review now. I just need to rebase

duncanMR commented 4 months ago

@Mergifyio rebase

mergify[bot] commented 4 months ago

rebase

❌ Unable to rebase: user duncanMR is unknown.

Please make sure `duncanMR` has logged in Mergify dashboard.
duncanMR commented 4 months ago

Fantastic. I've gone through this carefully and just spotted one detail. Ready to merge after that's fixed.

Can you keep the scripts etc that you use here handy, as it would be nice to do a tskit .dev news item with these perf numbers when we release this. Once you've implemented the seek_forward/backward functionality I think we'll want to do a joint Python and C API release, as the perf improvements are significant.

Thanks for the feedback; I've corrected both issues and rebased. I'll definitely keep my scripts handy; I'll be using them to assess the new seek functionality soon.