Open a-ignatieva opened 2 years ago
Hi @a-ignatieva,
Thanks for the suggestion and example code. I believe the first function in your code (get_num_lineages
) is being discussed at https://github.com/tskit-dev/tskit/issues/386, with reference to using it for the Relate statistic, so this is certainly something that there is motivation to do.
Maybe I've missed something, but it looks like your function counts the number of nodes later than a node rather than the lineages?
Ah great, thanks @benjeffery - I figured this probably wasn't a new thought!
I assumed there are no unary nodes and a binary tree, so just seemed like a shortcut to count nodes that are older than the given node (as each one creates one additional lineage). That works right?
I thought it'd best to go by nodes rather than times (the Relate test at least needs the number of lineages just above a node) - and something like this should work with polytomies I think:
def get_num_lineages_in_tree(tree):
results = {}
count = 1
for node in tree.nodes(order="timedesc"):
if tree.is_sample(node):
results[node] = tree.num_samples()
else:
results[node] = count
count += tree.num_children(node) - 1
return results
This is great, thanks @a-ignatieva. To get an idea of how to structure this so we can compute it as efficiently as we can, would we usually want to do this for all mutations? So, we'd have a TreeSequence method like
def selection_test_pvalue(self):
""""
Return an array of P-values under the Relate selection test for each mutation in the tree sequence.
"""
or would we usually hand-pick specific mutations?
I assumed there are no unary nodes and a binary tree, so just seemed like a shortcut to count nodes that are older than the given node (as each one creates one additional lineage). That works right?
Clever! Almost all tree sequences from (for example) tsinfer
will not have those properties, which is making wonder if there is value in adding Tree.has_unary_nodes()
and Tree.is_binary()
to the API.
To get an idea of how to structure this so we can compute it as efficiently as we can, would we usually want to do this for all mutations? or would we usually hand-pick specific mutations?
@jeromekelleher I think definitely both, ideally! I'm looking to do the former, but I can also imagine wanting to do the latter for specific mutations.
Almost all tree sequences from (for example) tsinfer will not have those properties, which is making wonder if there is value in adding Tree.has_unary_nodes() and Tree.is_binary() to the API.
@benjeffery I think this would be really great to have!
I assumed there are no unary nodes and a binary tree, so just seemed like a shortcut to count nodes that are older than the given node (as each one creates one additional lineage). That works right?
Clever! Almost all tree sequences from (for example)
tsinfer
will not have those properties, ...
I don't think we should assume binary trees here for the statistic. I guess it would not add much to the efficiency to check on number of children of a node in the tree, though.
I don't think we should assume binary trees here for the statistic. I guess it would not add much to the efficiency to check on number of children of a node in the tree, though.
Agreed - this would make it basically useless for tsinfer trees.
Picking this one up again, here's a proposal for what a single mutation implementation would look like (not that we have the Tree.num_lineages
method:
def relate_test_pvalue(self, mutation_id):
mut = self.mutation(mut)
site = self.site(mut.site)
tree = self.at(site.position)
time = self.node(mut.mutations).time
N = self.num_samples
k = 1 + tree.num_lineages(time)
fn = tree.num_samples(mut.node)
p = 0.0
for f in range(fn, N - k + 3):
p += (f - 1) * math.comb(N - f - 1, k - 3) / math.comb(N - 1, k - 1)
return p
Is this correct when we have polytomies?
It would be good to get this in to the library, and I'm happy to push through the coding if we can agree on an concrete definition to implement.
I think worth throwing an error or something if there's more than one mutation at the site? Eg if a reversion happens somewhere in the subtree below (or a homoplasy somewhere else) then this test doesn't work conceptually. Likewise if the mutation is a singleton (the test conditions on frequency >=2).
I think it's totally unclear to me how to do this properly with polytomies. Feels like it depends on what a polytomy represents - suppose just uncertainty in split times (rather than actual multiple mergers or something): then in a binary tree the number of lineages when a mutations gets to frequency 2 is well defined - but in a tree with polytomies taking k as [number of lineages just above the node + 1] will generally be an under- or over-estimate of the actual number of lineages that were there at this split point, right? Eg in this crude drawing, you're trying to calculate this for the circled mutation, and the tree on the right is just the tree on the left with only inferrable splits left in. Then k seriously depends on whether the polytomy time is above or below the time of your mutation node.
Thanks @a-ignatieva , this is super helpful (especially the diagram!)
I think worth throwing an error or something if there's more than one mutation at the site? Eg if a reversion happens somewhere in the subtree below (or a homoplasy somewhere else) then this test doesn't work conceptually. Likewise if the mutation is a singleton (the test conditions on frequency >=2).
Good call, I completely agree
I think it's totally unclear to me how to do this properly with polytomies. Feels like it depends on what a polytomy represents - suppose just uncertainty in split times (rather than actual multiple mergers or something):
So, I think we can just state that we assume the branching structure represents the true history, so in your example above k=9. If people are interested in a strictly binary model, then they need to use a different inference method, or somehow reconcile the non-binary inference with the trees themselves (e.g., by randomly generating binary trees using split_polytomies, or whatever). Note that in your example above just generating a random binary tree in the righthand subtree is of little use, because you also need branch lengths, and say, splitting equally would be quite unlikely under most population models.
I'm pretty comfortable with having polytomies in cases like this - they really do exist, sometimes (e.g., see the sc2 trees!)
I'd just worry that a user might not appreciate these subtleties when running this test on their trees - but maybe just there should be a clear description/warning in the documentation.
I think you have to be careful with the test in the actual multiple merger case. In the binary case we consider the tree just below the first split (at the dashed line in the pics above), and we have $f_N - 1$ ways of splitting $f_N$ samples into 2 groups, ${N-f_N-1 \choose k-3}$ ways of splitting the other $N-f_N$ samples into $k-2$ groups, divided by the total ${N-1 \choose k-1}$ ways of splitting $N$ samples into $k$ groups. This applies when you have binary trees and exchangeability so the distribution of leaf-set sizes is uniform, but is that the case for general multiple merger models? If there's a polytomy below the mutation this is already clearly incorrect (you'd need to replace 2 with the actual number of children), but I think the problem is bigger than that.
Also maybe better to replace the math.combs with log sums to avoid large N issues? Like
p = 0.0
for f in range(fn, N - k + 3):
if f == 2:
p += 1.0 / ((N-1) * (N-2))
else:
p1 = np.sum(np.log(range(N-k-f+3, N-k+1)))
p2 = np.sum(np.log(range(N-f, N)))
p += (f-1) * np.exp(p1 - p2)
p = p * (k-1) * (k-2)
Hmmm - so the fundamental question then I guess is whether this is well defined for non binary trees at all, and whether we should just raise an error if the tree is non binary.
Put another way - if we assume that we have something like tsinfer trees where the non binary nodes are from a mixture of uncertainty and genuine multiple mergers (in the large sample size case), would the result be a rough approximation of the truth, or just misleading?
It would be nice to have a function to calculate the selection test described in the Relate paper (p.16 of supplement) for mutations in a given tree. Or perhaps this is something that could be added to the documentation as an example, it's easy to code up using existing functions. This is the code I have (and an example).