tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Add selection test p-values #2190

Open a-ignatieva opened 2 years ago

a-ignatieva commented 2 years ago

It would be nice to have a function to calculate the selection test described in the Relate paper (p.16 of supplement) for mutations in a given tree. Or perhaps this is something that could be added to the documentation as an example, it's easy to code up using existing functions. This is the code I have (and an example).

import msprime
import math
import numpy as np

# Get number of lineages at time just above node
def get_num_lineages(tree, node):
    if tree.is_sample(node):
        count = tree.num_samples()
    else:
        nodes_ordered = tree.timedesc()
        count = 1 + np.where(nodes_ordered == node)[0]
    return count

# Calculate Relate selection test p-value
def selection_pvalue(tree, mutation):
    node = mutation.node
    if tree.is_sample(node):
        # Can only compute this for mutations with frequency >= 2
        p = None
    else:
        N = tree.num_samples()
        fn = tree.num_samples(node)
        # k is the number of lineages just after the edge carrying the mutation splits into 2
        k = 1 + get_num_lineages(tree, node)[0]
        p = 0.0
        for f in range(fn, N - k + 3):
            p += (f - 1) * math.comb(N - f - 1, k - 3) / math.comb(N - 1, k - 1)
    return p
ts = msprime.sim_ancestry(
    10,
    population_size = 1,
    sequence_length = 1,
    ploidy = 1,
    recombination_rate = 0,
    discrete_genome = False,
    random_seed = 1)

ts = msprime.sim_mutations(ts, 
                           rate=3, 
                           discrete_genome=False, 
                           random_seed=1, 
                           model=msprime.BinaryMutationModel())

pic1

for m in ts.mutations():
    print(m.id, m.site, ts.site(m.site).position, selection_pvalue(ts.first(), m))
0 0 0.027387595968320966 None
1 1 0.09280081023462117 0.41666666666666663
2 2 0.2044522489886731 0.5952380952380951
3 3 0.23608897626399994 None
4 4 0.4306985700968653 1.0
5 5 0.5181525482330471 0.41666666666666663
6 6 0.7783892354927957 1.0
7 7 0.8463109182193875 0.7222222222222222
8 8 0.8650202453136444 0.41666666666666663
9 9 0.9325573612004519 None
10 10 0.939127794932574 1.0
benjeffery commented 2 years ago

Hi @a-ignatieva, Thanks for the suggestion and example code. I believe the first function in your code (get_num_lineages) is being discussed at https://github.com/tskit-dev/tskit/issues/386, with reference to using it for the Relate statistic, so this is certainly something that there is motivation to do.

Maybe I've missed something, but it looks like your function counts the number of nodes later than a node rather than the lineages?

a-ignatieva commented 2 years ago

Ah great, thanks @benjeffery - I figured this probably wasn't a new thought!

I assumed there are no unary nodes and a binary tree, so just seemed like a shortcut to count nodes that are older than the given node (as each one creates one additional lineage). That works right?

I thought it'd best to go by nodes rather than times (the Relate test at least needs the number of lineages just above a node) - and something like this should work with polytomies I think:

def get_num_lineages_in_tree(tree):  
    results = {}
    count = 1
    for node in tree.nodes(order="timedesc"):
        if tree.is_sample(node):
            results[node] = tree.num_samples()
        else:
            results[node] = count
            count += tree.num_children(node) - 1
    return results
jeromekelleher commented 2 years ago

This is great, thanks @a-ignatieva. To get an idea of how to structure this so we can compute it as efficiently as we can, would we usually want to do this for all mutations? So, we'd have a TreeSequence method like

def selection_test_pvalue(self):
     """"
     Return an array of P-values under the Relate selection test for each mutation in the tree sequence.
     """

or would we usually hand-pick specific mutations?

benjeffery commented 2 years ago

I assumed there are no unary nodes and a binary tree, so just seemed like a shortcut to count nodes that are older than the given node (as each one creates one additional lineage). That works right?

Clever! Almost all tree sequences from (for example) tsinfer will not have those properties, which is making wonder if there is value in adding Tree.has_unary_nodes() and Tree.is_binary() to the API.

a-ignatieva commented 2 years ago

To get an idea of how to structure this so we can compute it as efficiently as we can, would we usually want to do this for all mutations? or would we usually hand-pick specific mutations?

@jeromekelleher I think definitely both, ideally! I'm looking to do the former, but I can also imagine wanting to do the latter for specific mutations.

Almost all tree sequences from (for example) tsinfer will not have those properties, which is making wonder if there is value in adding Tree.has_unary_nodes() and Tree.is_binary() to the API.

@benjeffery I think this would be really great to have!

hyanwong commented 2 years ago

I assumed there are no unary nodes and a binary tree, so just seemed like a shortcut to count nodes that are older than the given node (as each one creates one additional lineage). That works right?

Clever! Almost all tree sequences from (for example) tsinfer will not have those properties, ...

I don't think we should assume binary trees here for the statistic. I guess it would not add much to the efficiency to check on number of children of a node in the tree, though.

jeromekelleher commented 2 years ago

I don't think we should assume binary trees here for the statistic. I guess it would not add much to the efficiency to check on number of children of a node in the tree, though.

Agreed - this would make it basically useless for tsinfer trees.

jeromekelleher commented 1 year ago

Picking this one up again, here's a proposal for what a single mutation implementation would look like (not that we have the Tree.num_lineages method:

def relate_test_pvalue(self, mutation_id):
     mut = self.mutation(mut)
     site = self.site(mut.site)
     tree = self.at(site.position)
     time = self.node(mut.mutations).time 
     N = self.num_samples
     k = 1 + tree.num_lineages(time)
     fn = tree.num_samples(mut.node)
     p = 0.0
     for f in range(fn, N - k + 3):
         p += (f - 1) * math.comb(N - f - 1, k - 3) / math.comb(N - 1, k - 1)
     return p

Is this correct when we have polytomies?

It would be good to get this in to the library, and I'm happy to push through the coding if we can agree on an concrete definition to implement.

a-ignatieva commented 1 year ago

I think worth throwing an error or something if there's more than one mutation at the site? Eg if a reversion happens somewhere in the subtree below (or a homoplasy somewhere else) then this test doesn't work conceptually. Likewise if the mutation is a singleton (the test conditions on frequency >=2).

I think it's totally unclear to me how to do this properly with polytomies. Feels like it depends on what a polytomy represents - suppose just uncertainty in split times (rather than actual multiple mergers or something): then in a binary tree the number of lineages when a mutations gets to frequency 2 is well defined - but in a tree with polytomies taking k as [number of lineages just above the node + 1] will generally be an under- or over-estimate of the actual number of lineages that were there at this split point, right? Eg in this crude drawing, you're trying to calculate this for the circled mutation, and the tree on the right is just the tree on the left with only inferrable splits left in. Then k seriously depends on whether the polytomy time is above or below the time of your mutation node. IMG_20230208_150032080~2

jeromekelleher commented 1 year ago

Thanks @a-ignatieva , this is super helpful (especially the diagram!)

I think worth throwing an error or something if there's more than one mutation at the site? Eg if a reversion happens somewhere in the subtree below (or a homoplasy somewhere else) then this test doesn't work conceptually. Likewise if the mutation is a singleton (the test conditions on frequency >=2).

Good call, I completely agree

jeromekelleher commented 1 year ago

I think it's totally unclear to me how to do this properly with polytomies. Feels like it depends on what a polytomy represents - suppose just uncertainty in split times (rather than actual multiple mergers or something):

So, I think we can just state that we assume the branching structure represents the true history, so in your example above k=9. If people are interested in a strictly binary model, then they need to use a different inference method, or somehow reconcile the non-binary inference with the trees themselves (e.g., by randomly generating binary trees using split_polytomies, or whatever). Note that in your example above just generating a random binary tree in the righthand subtree is of little use, because you also need branch lengths, and say, splitting equally would be quite unlikely under most population models.

I'm pretty comfortable with having polytomies in cases like this - they really do exist, sometimes (e.g., see the sc2 trees!)

a-ignatieva commented 1 year ago

I'd just worry that a user might not appreciate these subtleties when running this test on their trees - but maybe just there should be a clear description/warning in the documentation.

I think you have to be careful with the test in the actual multiple merger case. In the binary case we consider the tree just below the first split (at the dashed line in the pics above), and we have $f_N - 1$ ways of splitting $f_N$ samples into 2 groups, ${N-f_N-1 \choose k-3}$ ways of splitting the other $N-f_N$ samples into $k-2$ groups, divided by the total ${N-1 \choose k-1}$ ways of splitting $N$ samples into $k$ groups. This applies when you have binary trees and exchangeability so the distribution of leaf-set sizes is uniform, but is that the case for general multiple merger models? If there's a polytomy below the mutation this is already clearly incorrect (you'd need to replace 2 with the actual number of children), but I think the problem is bigger than that.

Also maybe better to replace the math.combs with log sums to avoid large N issues? Like

p = 0.0
for f in range(fn, N - k + 3):
     if f == 2:
          p += 1.0 / ((N-1) * (N-2))
     else:
          p1 = np.sum(np.log(range(N-k-f+3, N-k+1)))
          p2 = np.sum(np.log(range(N-f, N)))
          p += (f-1) * np.exp(p1 - p2)
p = p * (k-1) * (k-2)
jeromekelleher commented 1 year ago

Hmmm - so the fundamental question then I guess is whether this is well defined for non binary trees at all, and whether we should just raise an error if the tree is non binary.

Put another way - if we assume that we have something like tsinfer trees where the non binary nodes are from a mixture of uncertainty and genuine multiple mergers (in the large sample size case), would the result be a rough approximation of the truth, or just misleading?