tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 70 forks source link

Add Tree.generate_random() method #1033

Closed hyanwong closed 3 years ago

hyanwong commented 3 years ago

It's now trivial to add a tskit.Tree.generate_random() method for generating random bifurcating trees with equal probability for each topology. All we have to do is:

def generate_random(self, num_leaves, random_seed=None):
    return self.generate_star(num_leaves).split_polytomy(random_seed)

And we probably don't need another unit test for this, because we already test it here, so all we need to do (IMO) is change that function to generate_random() rather than split_polytomies(), and we should be covered.

The arguments for this (taken from https://github.com/tskit-dev/tskit/pull/1030#issuecomment-733721001) are

  1. It allows tskit to produce random trees without relying on an external dependency. That could be useful e.g. for SLiM testing, or what have you.
  2. It's the sort of thing that a phylogeneticist might expect in a phylogenetic library (if we are trying to appeal to that audience)
  3. The distribution of coalescence tree topologies is not equiprobable - see below - the top plot is msprime topologies, the bottom is from split_polytomy.
import matplotlib.pyplot as plt
import collections

import tqdm
import msprime
import tskit

msprime_counts = collections.defaultdict(int)
split_poly_counts = collections.defaultdict(int)
for s in tqdm.trange(1, 10000):
    msprime_counts[msprime.simulate(5, random_seed=s).first().rank()] += 1

for s in tqdm.trange(1, 10000):
    split_poly_counts[tskit.Tree.generate_star(5).split_polytomies(random_seed=s).rank()] += 1

fig, axes = plt.subplots(2)
axes[0].bar(list(range(len(msprime_counts))), msprime_counts.values())
axes[1].bar(list(range(len(split_poly_counts))), split_poly_counts.values())

image

I'm happy to submit this as a PR if the approach I describe is the right one.

jeromekelleher commented 3 years ago

One caveat here: I think we should either add an arity=None argument here, which will currently raise an error if arity != 2. At some point we might generate random trees uniformly from the full distribution, so we should keep the door open there.

Either that, or we call the function generate_random_binary.

One use for this I can think of is plugging into hypothesis to produce random trees for testing.

jeromekelleher commented 3 years ago

Interesting the coalescent doesn't generate all topologies uniformly - anyone have an explanation for that?

hyanwong commented 3 years ago

One caveat here: I think we should either add an arity=None argument here, which will currently raise an error if arity != 2. At some point we might generate random trees uniformly from the full distribution, so we should keep the door open there.

Yes, good point.

Either that, or we call the function generate_random_binary.

I prefer your suggestion above.

A question though: what's this for? One thing I can think of is plugging into hypothesis to produce random trees for testing.

Basically, I see it as a general function that any phylogenetic library should have. Msprime is great, but it's focussed on coalescence, which is not always what a phylogeneticist is after. There's a number of times I've wanted to do this in other libraries.

jeromekelleher commented 3 years ago

Yeah, I'm convinced, let's get it in!

hyanwong commented 3 years ago

Interesting the coalescent doesn't generate all topologies uniformly - anyone have an explanation for that?

I guess it's because the long branches "attract" coalescent events, so that the trees are biased towards being balanced? Although the tcPBWT paper (bottom of page 26) talks about coalescence and says "All hierarchical topologies are equiprobable." . Hmm. I'll investigate a little further.

hyanwong commented 3 years ago

Ah! "not all tree topologies are equi-probable under a coalescent model [16, 26, 27]": references in https://bmcevolbiol.biomedcentral.com/articles/10.1186/1471-2148-8-118

[16] Degnan JH, Rosenberg NA: Discordance of Species Trees with Their Most Likely Gene Trees. PloS Genet. 2006, 2: 762-769. 10.1371/journal.pgen.0020068. [26] Harding EF: The Probabilities fo Rooted Tree-Shapes Generated by Random Bifurcation. Adv Appl Prob. 1971, 3: 44-77. 10.2307/1426329. [27] Brown JKM: Probabilities of Evolutionary Trees. Syst Biol. 1994, 43: 78-91. 10.2307/2413582.

jeromekelleher commented 3 years ago

Very nice, good digging.

hyanwong commented 3 years ago

I wonder if Vladimir & Richard Durbin got it wrong then, or if I just misread what they meant in the tcPBWT paper?

jeromekelleher commented 3 years ago

I guess it's because the long branches "attract" coalescent events, so that the trees are biased towards being balanced?

You're right, coalescent trees are asymptotically balanced so therefore they cannot be a uniform draw from the distribution.

Li H, Wiehe T. Coalescent tree imbalance and a simple test for selective sweeps based on microsatellite variation. PLoS Comput Biol. 2013;9(5):e1003060. pmid:23696722

hyanwong commented 3 years ago

You learn a new thing every day! It doesn't surprise me that they aren't equiprobable, but I hadn't really thought about it until now.

petrelharp commented 3 years ago

Another way to think about it is that (IIRC) the coalescent produces the uniform distribution on trees-with-time-ordering: i.e., if you keep track of the relative order in which nodes on different branches happen as well. The non-uniformity comes from then forgetting about this ordering, and there being more ways to arrive at some topologies than others. For instance, ((a,b),(c,d)) has twice the probability of ((a,(b,(c,d))), but that's because there's two ways it could have happened: either (a,b) occurred first, or (c,d) did.

hyanwong commented 3 years ago

I'm coding this up at the moment @jeromekelleher, but one thing that is bothering me a little bit about split_polytomies is that the nodes are placed such that siblings are at the same time - is there a specific reason for this? It means we can't calculate the time of the youngest inserted node without knowing the topology, which causes some (minor) problems with the generate_random method - mainly because we have to specify the height of the tree before we split it.

In the previous iteration of split_polytomies I simply made each node epsilon apart from the next in the table, regardless of the topology. That meant that the youngest new node was always at a time num_leaves - 2 from the root. That would make nice logical node times in the random tree, since we could set the root time in the star tree to num_leaves -1, and the epsilon of split_polytomies() to 1. Then all nodes would be at integer times, with the youngest at time 1, and we would be guaranteed never to have a problem with too large an epsilon value to split the tree.

Thoughts?

hyanwong commented 3 years ago

Actually, I've just realised that if we set the root height to n-1 and use an epsilon of 1, then of course we are always fine to split the tree. It's just that a balanced tree will have very long terminal branches, whereas the shortest terminal branch of a caterpillar tree will be 1. But I guess that's OK. So feel free to ignore my comment above (sorry!)

jeromekelleher commented 3 years ago

It seems most logical to assign siblings equal times, since there's no particular reason for them to be in one order or another. Then all of the short branches are of the same time.

We should probably not use split polytomies to create the random tree though, the nodes should be allocated in postorder fashion like the other tree generators. If we don't, the we won't be able to say that the tree we generate is identical to Tree.unrank(n, rank), which is a nice property.

hyanwong commented 3 years ago

It seems most logical to assign siblings equal times, since there's no particular reason for them to be in one order or another. Then all of the short branches are of the same time.

Yep, fair enough I guess. It's a trade-off against various properties.

We should probably not use split polytomies to create the random tree though, the nodes should be allocated in postorder fashion like the other tree generators. If we don't, the we won't be able to say that the tree we generate is identical to Tree.unrank(n, rank), which is a nice property.

Ah, well that makes it more difficult to do, I suppose. I have a PR using the polytomy split method - should I just stick this in as a draft for the time being, and leave this open?

jeromekelleher commented 3 years ago

It's very simple, just putting together bits that are already lying around. The majority of the work is the tests and the docs.

If you have them done I can finish off your PR?

hyanwong commented 3 years ago

Yep perfect. I'll PR it. Most of the docs are done - there's only a single extra test, but your nice test_generate_XXX framework sees to most of the rest.

hyanwong commented 3 years ago

At #1037 . Note that I moved the split_polytomy:test_all_topologies test out of that class, into the new class, but if you aren't using split_polytomies any more, you might want to move it back.