tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 69 forks source link

Topology counter not aligning with coalescent expectation #2872

Open TymekPieszko opened 7 months ago

TymekPieszko commented 7 months ago

I came across a discrepancy between the tskit classification of tree topologies and the Kingman coalescent. If genealogies are rooted, bifurcating trees with nodes ordered in time, then there are 18 such trees with 4 tips:

import math
def coalescent_topologies(k):
    if k == 2:
        return 1
    else:
        return math.comb(k, 2) * coalescent_topologies(k-1)
coalescent_topologies(4)
# 18 trees

And all these trees are equally likely. In contrast, there are only 15 distinct tree ranks in an msprime simulation. For example:

import tskit, msprime
from collections import Counter
reps = 10000
ranks = []
for rep in (range(reps)):
    ts = msprime.sim_ancestry(2, population_size=1000)
    ranks.append(ts.first().rank())
len(Counter(ranks))
# 15 ranks

This stems from the 6 symmetric genealogies getting lumped into 3 ranks, each with double the frequency of each of the remaining 12 ranks. Plotting % of each rank from the example above:

image

So, for instance, there are 2 distinct ((0,1),(2,3)) coalescent trees depending on which pair of tips coalesces first. Ranks do not distinguish between the order of coalescence events, so both get the same rank (4,0). I can see many potential issues with this, e.g., when wanting to compare against a null distribution of genealogies. I think coalescent trees themselves and not their topologies should be counted by default?

jeromekelleher commented 7 months ago

Tricky one. At a minimum we need to carefully explain and document this. Ideally I guess we'd support counting ordered coalescent trees also.

From an empirical perspective, do you think the current counting is doing what would be expected @TymekPieszko?

TymekPieszko commented 7 months ago

I think it depends @jeromekelleher. I can imagine it could be problematic if wanting to compare some analytical expectations of the frequency of genealogies against msprime output, especially if the user expects topology counter to align with the theory. While the example with 4 samples is manageable (i.e., I know what gets lumped with what and why), the discrepancy quickly gets out of hand (e.g., there are 2700 coalescent genealogies with 6 samples but only 945 corresponding ranks).

Perhaps it would be good to hear @hyanwong 's perspective.

hyanwong commented 7 months ago

The topology API is not specifically linked to coalescent trees (for example we also count trees with multifurcations / polytomies). It is fairly clear to me that a tree topology should not consider branch lengths, and therefore should not care about the time ordering of internal nodes.

It would certainly be worth noting that the coalescent does not result in an even distribution of possible topologies (e.g. as counted by https://oeis.org/A000311). This is, presumably, a well-known fact to phylogenetic + coalescent theorists? Other methods of generating trees (e.g. birth-death processes) will produce other non-even distributions of topologies.

Note that the trees created by e.g. tskit.Tree.unrank(4, (4, 0)) have both internal nodes at the same time, which doesn't correspond to either of the (Hudson) coalescent alternatives that you mention.

hyanwong commented 7 months ago

Note also that this is relevant for how to split polytomies. In fact, the R and python implementations of polytomy splitting originally did so in a coalescent-y type way, until I pointed out that this was not equiprobable:

https://www.mail-archive.com/r-sig-phylo@r-project.org/msg05595.html

E.g. Mark Holder said:

[the previous Dendropy method gave]an equiprobable distribution over labelled histories (where the depth of the nodes you are pinching off matters) rather than topologies. That pinching off pairs is essentially what we do when simulating coalescent histories

See also page 6 in https://cran.r-project.org/web/packages/ape/vignettes/RandomTopologies.pdf

hyanwong commented 7 months ago

@TymekPieszko: if the topology counter counts "coalescent tree shapes", what do we do in a WF simulation when two coalescent events happen simultaneously?

My suspicion is that it would be better to have a way of multiplying the counts by some value depending on the balance of the tree (or some other metric that accounts for node depth order). Do you want to investigate this @TymekPieszko and figure out what the function should be to return such a multiplier, given a tree (or the shape of a rank)?

If it is quick to calculate (I suspect it won't be!), then we can define a coalescent_combinations function and do something like

expectation = {}
for tree in tskit.all_trees():
   expectation[tree.rank()] = tskit.combinatorics.coalescent_combinations(tree)

tot = sum(expectation.values())
expectation = {k: e/sum for k, e in expectation.items()}
jeromekelleher commented 7 months ago

I doubt there is a straightforward multiplier here because the symmetry breaking is recursive, and you'd need a separate rank/unrank method.

I think the action we need to close this issue is some documentation showing this example, and just warning the user that we're not counting possible coalescent histories but just leaf-labelled topologies.

hyanwong commented 7 months ago

Yes, I agree that this may be complicated and there may not be a nice algorithm for weights. @jeromekelleher makes a good suggestion to close the issue.

In case it helps a literature search etc, here is the sequence of "Number of coalescence sequences or labeled histories for n lineages: the number of sequences by which n distinguishable leaves can coalesce to a single sequence (Rosenberg 2019)": https://oeis.org/A006472 - these are binary trees only.

And here is the sequence of number of labelled topologies: https://oeis.org/A000311

Note that those pages have useful search terms with might allow a further literature search.

Also, note that the topology counter includes ranks with polytomies, so it's never going to map neatly onto coalescent trees (and is not intended to do so).

hyanwong commented 7 months ago

I suggest the best thing would be for Tymek to put a note somewhere in https://tskit.dev/tskit/docs/stable/topological-analysis.html. If you want to work up an example, @TymekPieszko, you could put it in https://tskit.dev/tutorials/counting_topologies.html which is intended to be a bit more long-form.

TymekPieszko commented 6 months ago

I agree adding documentation is the minimal requirement to close this issue @hyanwong & @jeromekelleher - working on a PR which will summarise what has been said here. Thank you both!

jeromekelleher commented 6 months ago

Feel free to open a "draft" PR with an early version for feedback @TymekPieszko