Show nearest neighbor interchange example?

jeromekelleher commented 6 years ago

We're currently defining tree sequences as representing the genetic history of a sample of individuals subject to recombination. However, we (I think) can also use the basic ideas to efficiently represent a set of trees generated by a stochastic search through tree-space. The essential idea is the same: adjacent trees don't change much, so we just store edges that change between them. It seems straightforward to do an example where we start with some large tree, and then randomly change it using (e.g.) nearest-neighbour interchange, storing all trees that we visit in a set of tables. We could then compare this with storing all the trees in newick. I think programs like BEAST do this sort of thing, so it's a potentially useful application.

This is probably a tangent though, and might not be worth the bother (or perhaps sufficiently obvious that we can just state it in the discussion). Any thoughts?

petrelharp commented 6 years ago

wow, that's awesome. It sounds worth mentioning. Here's some BEAST output: https://raw.githubusercontent.com/taming-the-beast/Introduction-to-BEAST2/master/precooked_runs/primate-mtDNA.trees

Are you suggesting (a) take some BEAST output (b) turn it into tables in the obvious way, one tree at a time, with one unit of "chromosome" equal to one tree, (c) simplify? Or, we could just write a simple SPR random walk through treespace that records a tree sequence.

I don't want to postpone finishing, but this sounds possibly worth it.

jeromekelleher commented 6 years ago

Are you suggesting (a) take some BEAST output (b) turn it into tables in the obvious way, one tree at a time, with one unit of "chromosome" equal to one tree, (c) simplify? Or, we could just write a simple SPR random walk through treespace that records a tree sequence.

Hmmm, OK. I'm suggesting that we start with some tree, and then do a random walk through tree space from there, which we fully record. For a decently sized n and a reasonably long walk, we should make dramatic gains over the standard O(n) per tree newick approach. Parsing the BEAST output and trying to figure out what nodes have changed would be pretty messy, and it certainly wouldn't be a quick and easy thing to do.

Thinking this through a bit more:

It's probably not that interesting to do an SPR random walk, since that's effectively (sort-of, anyway) what we're doing with a coalescent simulation. Some other way of moving around tree space would be better, as this would demonstrate that we're not limited to efficiently encoding SPRs.
What's our interpretation of nodes then? Do we create one or more new nodes for every new tree, or do we try to 'return' to previous nodes in the tree sequence?
Is branch length important, or can we just think about topology?

I know almost nothing about this!

I don't want to get too distracted --- if it's not a 30 minute coding exercise for me, then I think we should just mention this as an interesting direction for future work and move on.

petrelharp commented 6 years ago

It's probably not that interesting to do an SPR random walk, since that's effectively (sort-of, anyway) what we're doing with a coalescent simulation. Some other way of moving around tree space would be better, as this would demonstrate that we're not limited to efficiently encoding SPRs.

Hm, I'm not sure it has to be more "interesting": just making the point that the encoding is not limited to genealogical tree sequences is good. And, I think that any move people use in practice is expressible as not very many SPRs. I can check with Erick Matsen about this.

What's our interpretation of nodes then? Do we create one or more new nodes for every new tree, or do we try to 'return' to previous nodes in the tree sequence?

node = species label, no? So, same set of nodes for all trees?

Is branch length important, or can we just think about topology?

In phylogenetics, branch length is important but confusing, I'd say. Uh-oh, this is a big problem, though: in different trees, nodes will be at different heights. (in either case of topology or branch lengths) And I don't think if each tree had a different set of nodes that the encoding would be efficient?

jeromekelleher commented 6 years ago

In phylogenetics, branch length is important but confusing, I'd say. Uh-oh, this is a big problem, though: in different trees, nodes will be at different heights. (in either case of topology or branch lengths) And I don't think if each tree had a different set of nodes that the encoding would be efficient?

Yeah, this is the issue I was getting at above with the node question. Surely most of the nodes between adjacent trees are the same though, right? This must be what it means to make short jumps in tree space. Looking at the discussion of NNI here it looks to me like we can just reuse the nodes most of the time.

How does this sound:

Generate a random binary tree using msprime, and change the time values for each node to its level.
Choose an internal node with at least 4 leaves uniformly (we need to have 4 subtrees I think).
Let u be the focal node, and v and w be its children. End the edges from v to its children, and w to its children.
Swap one a child for v to w and create new edges to v.
Create a new node if there is any time violation (i.e., one of our newly inserted children is older than its parent).

It's all pretty simple except for deciding the time of nodes, I think.

petrelharp commented 6 years ago

Yeah, this is the issue I was getting at above with the node question.

I'm definately being slow here. Duh, no species labels for internal nodes. =) But yes, you're right; internal nodes could retain their meaning for a while.

In your scheme you've erased two edges and the four coming off their ends; in NNI I think you erase one edge and the four coming off? Wouldn't it be:

Choose an internal node v uniformly, and let u be its parent (and suppose for now that u isn't the root).
Call a and b the children of v; call c the child of u that isn't v, and d the parent of u.
Do one of these things:
- erase v and reattach b somewhere at random on the branch leading from u to c
- erase v and reattach a somewhere at random on the branch leading from u to c
- erase u and reattach c somewhere at random on the branch leading from d to v

The last one wouldn't be necessary if we were just dealing with topologies. Note that these are all SPR moves.

ashander commented 6 years ago

Very cool connection to make! It definitely seems true that msprime's efficient storage of and iteration over trees has broader applications. Clearly, the ability we describe here to efficiently and with clear APIs move data from 3rd party code in/out of tree squences does enable all sorts of potential applications. And I think we should mention that in the discussion, with at least this as an example.

But --- and I've only skimmed and may be missing something--- to me an analysis or code demo of this does seem like quite a tangent to the main point of this paper: describing "efficiently recording the entire genetic history of a population during a forwards-time, individual-based population genetics simulation" and the tech/APIs that enable the efficient recording.

some questions:

is the gain here over standard approaches merely in storage or in speed of iteration/movement between trees?
does BEAST and similar really use internally an O(n) Newick representation? If so maybe they don't typically generate very many trees simultaneously so it doesn't matter much to them?

jeromekelleher commented 6 years ago

But --- and I've only skimmed and may be missing something--- to me an analysis or code demo of this does seem like quite a tangent to the main point of this paper: describing "efficiently recording the entire genetic history of a population during a forwards-time, individual-based population genetics simulation" and the tech/APIs that enable the efficient recording

Yes, you're right @ashander, this is a tangent. There's probably a paper in looking at this stuff properly, and we should finish what we have ASAP.

is the gain here over standard approaches merely in storage or in speed of iteration/movement between trees? does BEAST and similar really use internally an O(n) Newick representation? If so maybe they don't typically generate very many trees simultaneously so it doesn't matter much to them?

Something like BEAST outputs a sequence of trees representing the MCMC chain. BEAST only outputs a subset of the trees, to reduce storage costs. So the two obvious gains are (1) we could store ALL the trees in substantially less space and (2) we could post-process these trees afterwards a lot more quickly. These aren't game-changing differences though. There's probably more interesting things that could be done in terms of computing the average tree from the trace, but that's real algorithmic work...

You're scheme for NNI sounds right @petrelharp --- I just don't know very much about it. Given that we know we can support SPRs, I think we have enough material here for an interesting paragraph in the discussion.

petrelharp commented 6 years ago

an interesting paragraph in the discussion

That sounds right.

petrelharp / ftprime_ms

Show nearest neighbor interchange example? #49