tskit-dev / tsconvert

Utilities for converting tree sequences to and from other formats
MIT License
7 stars 6 forks source link

First pass at BEAST import. #11

Open jeromekelleher opened 4 years ago

jeromekelleher commented 4 years ago

An initial pass at doing BEAST tree imports. Unfortunately, it looks like node times are not maintained between adjacent trees and so we cannot turn it the input into a meaningful tree sequence (well, not a succinct one, anyway). We will always get a JBOT out of BEAST output as far as I can tell.

For example, I tried this code on the PRS.trees file I found on figshare. This has 429 leaves, and is quite a bit tree file. Typically, there's 10,000 trees in a BEAST analysis, which in this case gives a 407MB nexus file.

Running the conversion code on a 46MB subset of this, we get a 41MB .trees file (not including the "rates" annotations for nodes, or any other metadata; we should include the sample names, but the rates are probably irrelevant). So there's no real compression here. However, the load time is a massive improvement - parsing the trees file with dendropy is extremely slow (I tried to run on the full 407MB file, but my computer ran out of memory (16G) after about 10 minutes).

I'm not sure why we can't identify nodes by their times. In the fairly large case of 429 leaves it seems implausible that an MCMC would change every branch in every tree after 5000 samples. We're definitely not failing to identify nodes because of the inherent numerical precision crappiness of newick and representation of time by branch lengths, as I've experimented a bit with precision and even nodes that are close to the leaves will change time. In this example, I looked at a node that was the parent of two leaves in many adjacent trees, and its time varies slightly from tree to tree:

0.03891914
0.03791369
0.03836194
0.03904479
0.0376319
0.03849151
0.03825644
0.03834921
0.03793981
0.03898491
0.03776027
0.03747061
0.03821779
0.03823619
0.03809714
0.03738423
0.03947856
0.03735052
0.03842629
0.03865787
0.03809055
0.04046528
0.04071309
0.0405558
0.04293515
0.04248632
0.04139696
0.04035231
0.04113374
0.04305416
0.04191715

This clearly isn't a precision issue --- I think BEAST must be recomputing the branch lengths for each tree.

If we put in a really low precision then we can't identify nodes by their times, as we get lots of nodes with the same time in the same tree.

JereKoskela commented 4 years ago

There is a move in BEAST called "Internal node heights", which I think resamples the heights of all internal nodes and hence changes every branch length in one go if accepted. I think it also has quite a high default probability of taking place (BEAST randomises exactly which update takes place at each time step). See here: https://beast.community/first_tutorial#mcmc-operators

Addendum: The corollary is that if we wanted to store all output (as opposed to just very thinned output), then we wouldn't see a JBOT except when a global update like this gets accepted. So what you're doing could well be worthwhile if it enables a feasible reduction in thinning.

jeromekelleher commented 4 years ago

Aha, thanks @JereKoskela! That makes perfect sense now.