Closed hyanwong closed 3 years ago
One caveat here: I think we should either add an arity=None
argument here, which will currently raise an error if arity != 2. At some point we might generate random trees uniformly from the full distribution, so we should keep the door open there.
Either that, or we call the function generate_random_binary
.
One use for this I can think of is plugging into hypothesis to produce random trees for testing.
Interesting the coalescent doesn't generate all topologies uniformly - anyone have an explanation for that?
One caveat here: I think we should either add an
arity=None
argument here, which will currently raise an error if arity != 2. At some point we might generate random trees uniformly from the full distribution, so we should keep the door open there.
Yes, good point.
Either that, or we call the function
generate_random_binary
.
I prefer your suggestion above.
A question though: what's this for? One thing I can think of is plugging into hypothesis to produce random trees for testing.
Basically, I see it as a general function that any phylogenetic library should have. Msprime is great, but it's focussed on coalescence, which is not always what a phylogeneticist is after. There's a number of times I've wanted to do this in other libraries.
Yeah, I'm convinced, let's get it in!
Interesting the coalescent doesn't generate all topologies uniformly - anyone have an explanation for that?
I guess it's because the long branches "attract" coalescent events, so that the trees are biased towards being balanced? Although the tcPBWT paper (bottom of page 26) talks about coalescence and says "All hierarchical topologies are equiprobable." . Hmm. I'll investigate a little further.
Ah! "not all tree topologies are equi-probable under a coalescent model [16, 26, 27]": references in https://bmcevolbiol.biomedcentral.com/articles/10.1186/1471-2148-8-118
[16] Degnan JH, Rosenberg NA: Discordance of Species Trees with Their Most Likely Gene Trees. PloS Genet. 2006, 2: 762-769. 10.1371/journal.pgen.0020068. [26] Harding EF: The Probabilities fo Rooted Tree-Shapes Generated by Random Bifurcation. Adv Appl Prob. 1971, 3: 44-77. 10.2307/1426329. [27] Brown JKM: Probabilities of Evolutionary Trees. Syst Biol. 1994, 43: 78-91. 10.2307/2413582.
Very nice, good digging.
I wonder if Vladimir & Richard Durbin got it wrong then, or if I just misread what they meant in the tcPBWT paper?
I guess it's because the long branches "attract" coalescent events, so that the trees are biased towards being balanced?
You're right, coalescent trees are asymptotically balanced so therefore they cannot be a uniform draw from the distribution.
Li H, Wiehe T. Coalescent tree imbalance and a simple test for selective sweeps based on microsatellite variation. PLoS Comput Biol. 2013;9(5):e1003060. pmid:23696722
You learn a new thing every day! It doesn't surprise me that they aren't equiprobable, but I hadn't really thought about it until now.
Another way to think about it is that (IIRC) the coalescent produces the uniform distribution on trees-with-time-ordering: i.e., if you keep track of the relative order in which nodes on different branches happen as well. The non-uniformity comes from then forgetting about this ordering, and there being more ways to arrive at some topologies than others. For instance, ((a,b),(c,d)) has twice the probability of ((a,(b,(c,d))), but that's because there's two ways it could have happened: either (a,b) occurred first, or (c,d) did.
I'm coding this up at the moment @jeromekelleher, but one thing that is bothering me a little bit about split_polytomies
is that the nodes are placed such that siblings are at the same time - is there a specific reason for this? It means we can't calculate the time of the youngest inserted node without knowing the topology, which causes some (minor) problems with the generate_random
method - mainly because we have to specify the height of the tree before we split it.
In the previous iteration of split_polytomies I simply made each node epsilon
apart from the next in the table, regardless of the topology. That meant that the youngest new node was always at a time num_leaves - 2
from the root. That would make nice logical node times in the random tree, since we could set the root time in the star tree to num_leaves -1
, and the epsilon of split_polytomies()
to 1. Then all nodes would be at integer times, with the youngest at time 1, and we would be guaranteed never to have a problem with too large an epsilon value to split the tree.
Thoughts?
Actually, I've just realised that if we set the root height to n-1
and use an epsilon of 1, then of course we are always fine to split the tree. It's just that a balanced tree will have very long terminal branches, whereas the shortest terminal branch of a caterpillar tree will be 1. But I guess that's OK. So feel free to ignore my comment above (sorry!)
It seems most logical to assign siblings equal times, since there's no particular reason for them to be in one order or another. Then all of the short branches are of the same time.
We should probably not use split polytomies to create the random tree though, the nodes should be allocated in postorder fashion like the other tree generators. If we don't, the we won't be able to say that the tree we generate is identical to Tree.unrank(n, rank), which is a nice property.
It seems most logical to assign siblings equal times, since there's no particular reason for them to be in one order or another. Then all of the short branches are of the same time.
Yep, fair enough I guess. It's a trade-off against various properties.
We should probably not use split polytomies to create the random tree though, the nodes should be allocated in postorder fashion like the other tree generators. If we don't, the we won't be able to say that the tree we generate is identical to Tree.unrank(n, rank), which is a nice property.
Ah, well that makes it more difficult to do, I suppose. I have a PR using the polytomy split method - should I just stick this in as a draft for the time being, and leave this open?
It's very simple, just putting together bits that are already lying around. The majority of the work is the tests and the docs.
If you have them done I can finish off your PR?
Yep perfect. I'll PR it. Most of the docs are done - there's only a single extra test, but your nice test_generate_XXX
framework sees to most of the rest.
At #1037 . Note that I moved the split_polytomy:test_all_topologies
test out of that class, into the new class, but if you aren't using split_polytomies
any more, you might want to move it back.
It's now trivial to add a
tskit.Tree.generate_random()
method for generating random bifurcating trees with equal probability for each topology. All we have to do is:And we probably don't need another unit test for this, because we already test it here, so all we need to do (IMO) is change that function to
generate_random()
rather thansplit_polytomies()
, and we should be covered.The arguments for this (taken from https://github.com/tskit-dev/tskit/pull/1030#issuecomment-733721001) are
I'm happy to submit this as a PR if the approach I describe is the right one.