Closed duncanMR closed 4 months ago
@jeromekelleher Unfortunately the CI issues remain on the fresh branch. Is there any way to force update the package cache?
Attention: 12 lines
in your changes are missing coverage. Please review.
Comparison is base (
813e361
) 89.75% compared to head (a614344
) 89.79%.
@benjeffery what'st the easiest way of dealing with this?
Ben suggested bumping the version number in the cache config, which I did in the last commit, but for some reason it seems to have not refreshed the cache.
Bumping the number will certainly have flushed the cache. Think there is a deeper issue here - checking it now.
BTW - The CI cache works across PRs, so a fresh one doesn't have any effect.
Hoping https://github.com/tskit-dev/tskit/pull/2875 fixes this CI issue.
Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue
Ah, I realised that using Sync Fork on GitHub causes a merge. Will not use it again. Hopefully CI works now, just need to fix the coverage issue
If you need to rebase to main and the PR is other wise complete you can use "mergifyio rebase" replacing the with a @.
I think I've covered all the cases where tsk_tree_position_t
was used outside of tsk_tree_t
. I haven't figured out how to resolve the codecov warning: the tests implicitly check if tsk_tree_position_init
returns a nonzero value within tsk_tree_init
, and I've added checks that all the initialised values are correct.
I would drop the last two commits for now @duncanMR, and focus on changing tree methods to use the tree_pos
I've moved the last two commits into another branch for now. I'm going to have to pause this for now, but I'll continue revising the tree positioning code next year.
I've moved over all the tree positioning functions to use tree_pos
; for now, I've made seek_from_null
just call tsk_tree_next
repeatedly for now, until we implement seek_forward
. All tests are passing except for 21 in the class tests/test_highlevel.py::TestSeekDirection
. I've narrowed down the cause: the left_child
and right_sib
arrays are calculated incorrectly when calling prev
from the null state:
from tests import tsutil
from tests import test_tree_positioning
ts = tsutil.all_trees_ts(3)
tree_state = test_tree_positioning.StatefulTree(ts)
tree_state.prev()
print(tree_state)
parent: [3, 3, 4, 4, -1]
position:
index: 3
interval: Interval(left=3.0, right=4.0)
direction: -1
in_range: EdgeRange(start=8, stop=4, order=array([0, 5, 2, 6, 4, 8, 7, 3, 1], dtype=int32))
out_range: EdgeRange(start=8, stop=8, order=array([0, 2, 4, 5, 8, 1, 6, 3, 7], dtype=int32))
So we will insert edges 1 ($3 \rightarrow 0$), 3 ($3 \rightarrow 1$), 7 ($4 \rightarrow 2$) and 8 ($4 \rightarrow 3$) in that order.
Because edge 7 is inserted first, we set left_child(4) = 2
and right_sib[2] = 3
. This is incorrect, because it is supposed to be left_child(4) = 3
and right_sib(3)=2
, as we can see from the tree plot:
I'm still trying to figure out how the system for node ordering works. I think we either need to change the in_range
calculation, or we have to revise how left_child
and right_sib
are updated, but I'm not sure yet.
Hmm, the ordering of children for a parent is documented as arbitrary, and we can change it. If that's the only problem then I wouldn't consider it a bug.
Is it just those tests expect a particular ordering?
Yes, it seems that test_highlevel.py::TestSeekDirection
is the only test class that checks the child and sib arrays directly with the assert_trees_identical
function. The tests fail because the child and sib arrays are calculated differently when reaching the last tree in a sequence with prev
or next
. I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?
I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?
It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different left_sib
arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.
So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.
I get that the ordering is arbitrary, but I would have thought we'd need it to be independent of the tree positioning direction?
It's always been conditioned on the precise order in which edges are inserted and removed. For example, you will slightly different
left_sib
arrays when you seek to the same tree in the left and right directions. There's no avoiding this if we want to keep it efficient for large polytomies.So, just to clarify, the arrays are correct here, but they are ordered slightly differently to the currently library code? I'm OK with that, if that's the case.
I see. Yes, the right_sib
and left_child
have just been permuted at one position; here, Tree 2 is what the current library code outputs and Tree 1 is the result of calling prev
from the null tree, which used to give the same result prior to my changes:
Tree 1 left_child: [-1 -1 -1 0 2 4]
Tree 2 left_child: [-1 -1 -1 0 3 4]
Tree 1 left_sib: [-1 0 -1 2 -1 -1]
Tree 2 left_sib: [-1 0 3 -1 -1 -1]
The trees are identical otherwise. Should I just remove the checks for whether the left/right child/sib arrays are identical for this test?
Yes for now can just comment out those tests so we can see how the rest of the suite is doing (would come up with a better solution before merging)
I've got seek_from_null
working using seek_forward
and seek_backward
. I just need to add some tests now and check that the performance hasn't regressed.
I had some trouble replicating Kevin Thornton's performance plots for seek_from_null
, since my timing data for each seek has higher variance and I don't have his benchmarking code to compare with. Strangely, the times I get for the new tree_pos
implementation have tended to have higher variance; sometimes, a single seek takes an order of magnitude longer than normal. Overall, it seems like the new implementation is comparable in performance to the old method, but slightly slower on average due to the occasionally long seek times:
import stdpopsim
import tskit
import time
import pandas as pd
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("OutOfAfrica_3G09")
contig = species.get_contig("chr1", length_multiplier=0.2)
samples = {"YRI": 0, "CHB": 0, "CEU": 2e4}
engine = stdpopsim.get_engine("msprime")
test_ts = engine.simulate(model, contig, samples, seed=1)
def seek_from_null_sweep(ts, method):
data = []
for n in range(0, ts.num_trees - 1, 500):
tree = tskit.Tree(ts)
before = time.perf_counter()
tree.seek(n)
duration = time.perf_counter() - before
data.append([n, duration, method])
# Convert the list to a DataFrame
df = pd.DataFrame(data, columns=['Index', 'Time', 'Method'])
return df
#lib_null_df = seek_from_null_sweep(test_ts, method="lib")
tree_pos_null_df = seek_from_null_sweep(test_ts, method="tree_pos")
We can't determine a clear difference between the methods from a single run, but if I repeatedly time how long it takes to seek from null to various points along the genome, the difference between the two implementations does seem significant in some cases. Here, I seeked to 1000 indices in the HGDP and msprime simulated trees, and only 10 indices in the sc2ts trees. This how I timed it:
def test_seek_from_null(ts, num_runs, num_indices):
times = np.zeros(num_runs)
tree = tskit.Tree(ts)
indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
for run in range(num_runs):
before = time.perf_counter()
for n in indices:
tree.clear()
tree.seek(n)
times[run] = time.perf_counter() - before
return np.mean(times), np.std(times)
Here are the results, including the tests for other tree positioning methods (the times are averaged over 30 runs for seek_from_null
and 50 runs for the other tests):
@jeromekelleher Do you think this warrants further investigation / profiling? My test setup is far from ideal: background processes and the complexity of WSL + Jupyter + Windows causes significant variation in the timings I calculate. In my first experiments, the performance of the new method was actually worse, but I couldn't replicate that later on.
I don't totally remember the details but my benchmarks were probably all done using a prototype written using tskit-rust and are now lost to history.
Interesting - what's the percentage difference in the average seek_from_null
times? I guess we should run it through perf
or something to see where the time is being spent in the low-level code.
I wrote a better performance test that calculates mean seek times for each of a subset of indices
def test_seek_from_null(ts, ts_name, num_runs, num_indices, method):
times_matrix = np.zeros((num_runs, num_indices))
indices = np.linspace(0, ts.num_trees - 1, num_indices, dtype=int)
for run in range(num_runs):
for i, n in enumerate(indices):
tree = tskit.Tree(ts)
before = time.perf_counter()
tree.seek(n)
after = time.perf_counter()
times_matrix[run, i] = (after - before) * 1000000
mean_times = np.mean(times_matrix, axis=0)
std_times = np.std(times_matrix, axis=0)
# Create a DataFrame to return
results_df = pd.DataFrame({
'index': indices,
'mean': mean_times,
'sd': std_times,
'method': method,
'ts': ts_name
})
return results_df
I computed the seek times for three TS, starting with a huge simulated ts:
The new algorithm performs substantially better toward the end of the sequence (results are averaged over 200 runs)
The variance in seek times is extremely high in the HGDP case for both methods, but the implementations perform within measurement error for each other
I had to reduce the number of runs to 50 for sc2ts because it takes so long. It's another clear win for the new implementation
I've updated the LSHMM and KC distance code. I removed the tsk_tree_position_step
function because it is actually useless (we never want to update tsk_tree_position
separately to the tree now). I think it's ready for review now. I just need to rebase
@Mergifyio rebase
rebase
duncanMR
is unknown.Fantastic. I've gone through this carefully and just spotted one detail. Ready to merge after that's fixed.
Can you keep the scripts etc that you use here handy, as it would be nice to do a tskit .dev news item with these perf numbers when we release this. Once you've implemented the seek_forward/backward functionality I think we'll want to do a joint Python and C API release, as the perf improvements are significant.
Thanks for the feedback; I've corrected both issues and rebased. I'll definitely keep my scripts handy; I'll be using them to assess the new seek
functionality soon.
Description
This PR aims to add
tree_pos
to thetsk_tree_t
struct and alternext
,prev
,first
,last
andseek_from_null
to use the tree positioning code. This means that we can remove the separatetsk_tree_position
iterator in LSHMM and KC distance code.Fixes #2794, Fixes #2795 and Fixes #2825.
PR Checklist: