matsengrp / larch

Inference and manipulation of history DAGs
2 stars 2 forks source link

Calculate leaf sequences from Fitch sets #44

Closed marybarker closed 11 months ago

marybarker commented 1 year ago

This is a beginning step toward implementing #6

Since the callback has access to the altered/updated nodes for a given move, we can fully resolve these nodes without using the entire tree if we use the Fitch set changes that the move provides access to through the node_with_major_allele_set_change vector.

The Fitch algorithm is an alternative to the Sankoff algorithm, where Fitch sets are assigned to nodes instead of cost vectors. Each site in the sequence for node n is assigned a Fitch set: a set containing the possible choices of base that minimize the subtree cost below that node.

Fitch Algorithm

(for simplicity, written for a length-1 sequence, but generalizes to length-k sequence if we allow a Fitch set for each site in the sequence)

Notation: given a node n

Then we assign Fitch sets to the nodes in a postorder traversal as follows:

Using the Fitch sets and boundary allele sets, the sequences are assigned in a preorder traversal:

willdumm commented 1 year ago

EDIT 1/13/23: Mary and I talked through lots of details, and I've attempted to update everything to reflect the details that we worked out. We think we've figured everything out, so hopefully this is a good starting point for an implementation.


EDIT 1/11/23: I've made various changes based on our conversation with Cheng, but I may have missed things, and anyone can feel free to make more corrections if you see them.


So we need a way to take an SPR move and build a tree fragment (like those in #43) summarizing all the changes (to CG's) that result from applying that SPR move to the original tree (sampled from the DAG).

Here's a description of how to do this in different words than Mary's that's still kind of vague:

Part One:

Although we don't want to actually apply the SPR move to a copy of the original tree, we could use a class storing references to the original tree and the SPR move data, to encapsulate the logic involved in getting node neighbors and other node data in the hypothetical tree resulting from applying that SPR move. I'll call this the hypothetical tree resulting from an SPR. Given a node in this hypothetical tree, class methods should allow easy computation of:

The mapping between nodes in this hypothetical tree and the original MAT can be via the node IDs, like in this picture: image

The highlighted nodes, which are on a path from the source node to the target node, are exactly the nodes whose LeafSets will be changed by the SPR move, so the mapping between the two trees on those nodes doesn't fully make sense. However, for the nodes which aren't highlighted, the natural mapping via node IDs is the one we want. Also, there may not be a mapping for the parent of the target node (the red node in the picture).

Part Two:

Define a recursive function BuildFragment(node, parent fragment node) which takes a node in the hypothetical tree resulting from an SPR move, and the tree fragment node built for the node's parent.

Definition of BuildFragment(node, parent fragment node):

* How to decide if a node is an anchor?

Besides the anchor node which is a tree fragment's root (which we determine below when deciding which node to start BuildFragment on) a node is an anchor node (meaning it's guaranteed to be the same as a node in the DAG, and there are no changes in the tree below that node resulting from the SPR move) if the following are all true:

** Which sites do we need to iterate through?

Normally, we need only iterate through the sites at which either the base on the parent node changed (relative to the parent node's CG before the SPR move), or at which there's a Fitch set change. However, there are exceptions:

Bringing it together:

Now given a (unmodified) MAT being optimized, and a proposed SPR move on that MAT, the corresponding tree fragment should be what we get from BuildFragment(earliest changed node, parent history node).

Here earliest changed node is the earliest node in the SPR move's hypothetical tree which has changed major alleles, or the latest common ancestor of the source and target nodes of the SPR, whichever is closer to the tree root. parent history node is the history DAG node corresponding to the parent of earliest changed node. If earliest changed node is the root of the tree, then parent history node is the universal ancestor node.

Finally, we can merge the resulting fragment into the DAG using #43.

Things that need to be clearer here:

yceh commented 1 year ago

By the way, the boundary allele set is not used in the matOptimize implementation of the Fitch algorithm. It is just one by-product useful for quickly calculating the parsimony score change from a single move.

$C_n$ is simply the alleles that happen most frequently in the $C_n$s of the children nodes.

yceh commented 1 year ago

One suggestion about applying Fitch set changes: I assume in Larch, multiple alleles in the Fitch set are represented as different clade nodes with the same split. For example, on the source side, a new split will be created corresponding to the parent split of the removed source node ( which may have multiple clade nodes if multiple major alleles are present) (Otherwise, if such a split already exists, then we can stop on the source side). Then, when deciding who are the incoming clades of the new split, we can look at the incoming clades of the original split. If none of their state at every site are in the removed allele set, connect the new split to it. Then, look at the added allele set, and create new clade nodes using the clade node where the parent node is sampled from as a template.

yceh commented 1 year ago

matOptimize represents Fitch set as mutations internally and is not exported, but I will write about its omission conditions anyway.

willdumm commented 1 year ago

Thanks Cheng for the really helpful meeting, and your ideas.

Here's a summary of what we talked about (thanks Mary for the first draft of these)

  1. Clarifying the SPR move definition, the source node is added as a sibling of the target node (by adding a new parent along the branch above the target). However, if it's equally parsimonious to collapse the branch between the new parent of the target and the target node, we end up with the source as a new child of the target node. This decision is made after move application, however, so at the time that our move callback is called, the scenario that's considered is that source and target are siblings.
  2. Fitch set storage is analogous to our compact genome storage, and the fitch set for a site can be omitted if the conditions in Cheng's last comment are satisfied.
  3. These compactly represented Fitch sets are computed in a preprocessing step when a MAT is initially loaded.
  4. The fitch set differences which are supplied to the callback with an SPR move in nodes_with_major_allele_set_change are computed in a single postorder pass, and are stored as set differences. Since only a single node is added/removed from branches of the tree, fitch sets may differ by a maximum of one base, so that base is all that needs to be stored.
  5. When matOptimize chooses bases (downward Fitch) it chooses bases only according to the Fitch set (major allele set), and does not consider the boundary allele set at all. This is why the chosen base must be found in the majority of the child nodes' fitch sets. That is, the normal conditions for choosing a base:
    • the parent base, if it’s in the fitch set
    • the parent base, or a base from the fitch set, if the parent base is a boundary allele
    • a base from the fitch set, if the parent base isn’t in the boundary set. are simplified to:
    • the parent base, if its in the fitch set
    • otherwise, some base from the fitch set.
willdumm commented 1 year ago

One suggestion about applying Fitch set changes: I assume in Larch, multiple alleles in the Fitch set are represented as different clade nodes with the same split.

This is true, but not all alternative alleles are represented, only whichever ones happen to have been chosen in a tree that was merged into the DAG. I think this could make your next suggestion (below) even simpler.

For example, on the source side, a new split will be created corresponding to the parent split of the removed source node ( which may have multiple clade nodes if multiple major alleles are present) (Otherwise, if such a split already exists, then we can stop on the source side). Then, when deciding who are the incoming clades of the new split, we can look at the incoming clades of the original split. If none of their state at every site are in the removed allele set, connect the new split to it. Then, look at the added allele set, and create new clade nodes using the clade node where the parent node is sampled from as a template.

I think this is somewhat equivalent to what we're suggesting, although possibly simpler. I'll defer to @ognian- on whether we'd like to replace the tree fragment merging scheme with a more direct modification of the DAG. Either way, I'll reframe this in the context of an implementation of BuildFragment (updated above in the description of BuildFragment)

willdumm commented 1 year ago

Cheng provided some additional clarification about how fitch set changes are represented for an SPR move, in nodes_with_changed_major_alleles set. Some of the following is directly copied from our conversation, and some is edited slightly.

For example, with this SPR with source node 7 and target node 9: image (9)

there is a fitch set change on node 10, and on the new node which is colored red and has the new label 11.

Where the incremented_allele and decremented_allele values are one-hot-encoded subsets of {A,G,C,T}.

The fix to properly present new node fitch sets as changes to the target node is here: https://github.com/yceh/usher/commit/793036ec630f2fb65c05a0ae5b3d3d7c2df117c7

Finally, in the case when the source node has only one sibling, its parent remains in the tree after the SPR move despite having only one child. We will of course want to change this to avoid unifurcation when building an SPR move's fragment.