Calculate leaf sequences from Fitch sets

This is a beginning step toward implementing #6

Since the callback has access to the altered/updated nodes for a given move, we can fully resolve these nodes without using the entire tree if we use the Fitch set changes that the move provides access to through the node_with_major_allele_set_change vector.

The Fitch algorithm is an alternative to the Sankoff algorithm, where Fitch sets are assigned to nodes instead of cost vectors. Each site in the sequence for node n is assigned a Fitch set: a set containing the possible choices of base that minimize the subtree cost below that node.

Fitch Algorithm

(for simplicity, written for a length-1 sequence, but generalizes to length-k sequence if we allow a Fitch set for each site in the sequence)

Notation: given a node n

$C_n$ is the set of children of n
$F_n$ is the Fitch set associated to n
$B_n$ is the Boundary set of n

Then we assign Fitch sets to the nodes in a postorder traversal as follows:

For a leaf node n, the Fitch set is the singleton set that contains n
For a non-leaf node n:
- if the intersection $\cap_{c\in C_n} F_c$ is nonempty, then $$Fn := \bigcap{c\in C_n}F_c$$ and $Bn$ is the set of elements in $\bigcup{c\in C_n}$ that occur with highest frequency.
- if this intersection $\bigcap_{c\in C_n} F_c=\emptyset$, then $Fn$ is the set of most common elements of $\bigcup{c\in C_n}F_c$, and set $B_n$ to be the set of next-most-common elements.

Using the Fitch sets and boundary allele sets, the sequences are assigned in a preorder traversal:

If the parent p of n has base that is in $F_n$ then set the base of n equal to that base.
Else, if the parent's base is in $B_n$, assign the base of n to be that base.
Otherwise, choose the base of n to be an element of $F_n$.

EDIT 1/13/23: Mary and I talked through lots of details, and I've attempted to update everything to reflect the details that we worked out. We think we've figured everything out, so hopefully this is a good starting point for an implementation.

EDIT 1/11/23: I've made various changes based on our conversation with Cheng, but I may have missed things, and anyone can feel free to make more corrections if you see them.

So we need a way to take an SPR move and build a tree fragment (like those in #43) summarizing all the changes (to CG's) that result from applying that SPR move to the original tree (sampled from the DAG).

Here's a description of how to do this in different words than Mary's that's still kind of vague:

Part One:

Although we don't want to actually apply the SPR move to a copy of the original tree, we could use a class storing references to the original tree and the SPR move data, to encapsulate the logic involved in getting node neighbors and other node data in the hypothetical tree resulting from applying that SPR move. I'll call this the hypothetical tree resulting from an SPR. Given a node in this hypothetical tree, class methods should allow easy computation of:

the node's fitch set at each site. For some sites, the fitch set is omitted, but this only happens when the fitch set is a singleton, and the parent fitch set is the same. In this case, the base at that site in the compact genome that we've chosen for the parent node (in the BuildFragment routine) is the base in this singleton set, as long as there are no changes to the fitch set at this site. If there are changes, I think we need to traverse up the tree to the last node that has a recorded fitch set at this site, to figure out what it is. See this comment for details about how the new node's fitch set changes are encoded https://github.com/matsengrp/larch/issues/44#issuecomment-1381021310
the original compact genome for each node, before the SPR move. For the new node (red node in the picture) we will use the target node's compact genome.
the node's children. This is straightforward, except for when the parent of the source node only has one child after the SPR move. In this case, it may be convenient to just skip this node. That is, using the node IDs in this picture, the children of node 4 in the hypothetical tree should not be 3 and 5, but rather 1 and 5, and the parent of node 1 should be 4 rather than 3. This should not cause any problems because the fitch sets of node 1 and 3 after the SPR move are guaranteed to be the same.
the node's LeafSet (using a map from DAG nodes to unmodified MAT nodes) ... all as they would be if we really had applied the SPR move to the original MAT.

The mapping between nodes in this hypothetical tree and the original MAT can be via the node IDs, like in this picture:

The highlighted nodes, which are on a path from the source node to the target node, are exactly the nodes whose LeafSets will be changed by the SPR move, so the mapping between the two trees on those nodes doesn't fully make sense. However, for the nodes which aren't highlighted, the natural mapping via node IDs is the one we want. Also, there may not be a mapping for the parent of the target node (the red node in the picture).

Part Two:

Define a recursive function BuildFragment(node, parent fragment node) which takes a node in the hypothetical tree resulting from an SPR move, and the tree fragment node built for the node's parent.

Definition of BuildFragment(node, parent fragment node):

Let chosen CG be the compact genome for node, before the SPR move. If node is the new node (red node in the picture above) then use the target node's old CG.
For each site (**see below for explanation):
- Get Fitch set for site at node
- Choose a single base for node at site:
  - If parent fragment node is not the UA node, let parent base be the base in parent fragment node's (new chosen) compact genome at site, and if parent base is in Fitch set, then let chosen base be parent base.
  - otherwise, let chosen base be the base at site on corresponding node in unmodified tree, if that base is in the fitch set, otherwise choose a base randomly from fitch set.
  - adjust chosen CG to have chosen base at site and record that a change was made to chosen CG at that site, if necessary. (Which sites required changes will affect which sites must be considered when computing the compact genomes of any children of this node (**)).
- Create a tree fragment node v' with the same leaf set as node, and containing chosen CG, and add v' as a child of parent fragment node.
- If v' is not an anchor node (*see below for explanation):
  - for each child c of node, do BuildFragment(c, v').

* How to decide if a node is an anchor?

Besides the anchor node which is a tree fragment's root (which we determine below when deciding which node to start BuildFragment on) a node is an anchor node (meaning it's guaranteed to be the same as a node in the DAG, and there are no changes in the tree below that node resulting from the SPR move) if the following are all true:

Its clade (unioned LeafSet) is a subset of, or disjoint from, the source or target node's clade (meaning there's guaranteed to be a well-defined map between nodes before and after the SPR move, and implying that the LeafSet hasn't changed from the corresponding node in the sampled tree)
The compact genome chosen for the node is the same as the compact genome on the corresponding node in the sampled tree. This is guaranteed to occur for leaves, but should usually happen before traversing all the way to leaves.

** Which sites do we need to iterate through?

Normally, we need only iterate through the sites at which either the base on the parent node changed (relative to the parent node's CG before the SPR move), or at which there's a Fitch set change. However, there are exceptions:

if the parent fragment node is the UA node, we just iterate through all sites at which there's a fitch set change on node.
(very special case) if node is the source node, then none of its fitch sets have changed. We used the target node's CG as the template for choosing the CG of the parent fragment node which is the fragment node resulting from the "new node" (red node in the picture). However, the old parent of the source node may have had a very different CG. Therefore, we should iterate through all the sites at which the new CG for parent fragment node differs from the old CG of the source node's old parent (node 14 in the picture).
(optimization) If we're on separate branches from, or below, the source and target nodes, then fitch sets are guaranteed to have not changed, so we only need to consider sites at which the parent node's CG was changed.

Bringing it together:

Now given a (unmodified) MAT being optimized, and a proposed SPR move on that MAT, the corresponding tree fragment should be what we get from BuildFragment(earliest changed node, parent history node).

Here earliest changed node is the earliest node in the SPR move's hypothetical tree which has changed major alleles, or the latest common ancestor of the source and target nodes of the SPR, whichever is closer to the tree root. parent history node is the history DAG node corresponding to the parent of earliest changed node. If earliest changed node is the root of the tree, then parent history node is the universal ancestor node.

Finally, we can merge the resulting fragment into the DAG using #43.

Things that need to be clearer here:

We're going to ignore this for now, but we're assuming that the compact genome for parent history node is part of a maximally parsimonious labeling of the sampled tree's topology. This may not be true in an hDAG, but is more likely to be true when the sampled tree achieves the maximum parsimony score seen in the DAG.
...

By the way, the boundary allele set is not used in the matOptimize implementation of the Fitch algorithm. It is just one by-product useful for quickly calculating the parsimony score change from a single move.

$C_n$ is simply the alleles that happen most frequently in the $C_n$s of the children nodes.

One suggestion about applying Fitch set changes: I assume in Larch, multiple alleles in the Fitch set are represented as different clade nodes with the same split. For example, on the source side, a new split will be created corresponding to the parent split of the removed source node ( which may have multiple clade nodes if multiple major alleles are present) (Otherwise, if such a split already exists, then we can stop on the source side). Then, when deciding who are the incoming clades of the new split, we can look at the incoming clades of the original split. If none of their state at every site are in the removed allele set, connect the new split to it. Then, look at the added allele set, and create new clade nodes using the clade node where the parent node is sampled from as a template.

matOptimize represents Fitch set as mutations internally and is not exported, but I will write about its omission conditions anyway.

Fitch set is a singleton that is the same as the tie-broken parent allele. (Omission is done after the downward pass)
Boundary allele set is empty.

Thanks Cheng for the really helpful meeting, and your ideas.

Here's a summary of what we talked about (thanks Mary for the first draft of these)

Clarifying the SPR move definition, the source node is added as a sibling of the target node (by adding a new parent along the branch above the target). However, if it's equally parsimonious to collapse the branch between the new parent of the target and the target node, we end up with the source as a new child of the target node. This decision is made after move application, however, so at the time that our move callback is called, the scenario that's considered is that source and target are siblings.
Fitch set storage is analogous to our compact genome storage, and the fitch set for a site can be omitted if the conditions in Cheng's last comment are satisfied.
These compactly represented Fitch sets are computed in a preprocessing step when a MAT is initially loaded.
The fitch set differences which are supplied to the callback with an SPR move in nodes_with_major_allele_set_change are computed in a single postorder pass, and are stored as set differences. Since only a single node is added/removed from branches of the tree, fitch sets may differ by a maximum of one base, so that base is all that needs to be stored.
When matOptimize chooses bases (downward Fitch) it chooses bases only according to the Fitch set (major allele set), and does not consider the boundary allele set at all. This is why the chosen base must be found in the majority of the child nodes' fitch sets. That is, the normal conditions for choosing a base:
- the parent base, if it’s in the fitch set
- the parent base, or a base from the fitch set, if the parent base is a boundary allele
- a base from the fitch set, if the parent base isn’t in the boundary set. are simplified to:
- the parent base, if its in the fitch set
- otherwise, some base from the fitch set.

One suggestion about applying Fitch set changes: I assume in Larch, multiple alleles in the Fitch set are represented as different clade nodes with the same split.

This is true, but not all alternative alleles are represented, only whichever ones happen to have been chosen in a tree that was merged into the DAG. I think this could make your next suggestion (below) even simpler.

For example, on the source side, a new split will be created corresponding to the parent split of the removed source node ( which may have multiple clade nodes if multiple major alleles are present) (Otherwise, if such a split already exists, then we can stop on the source side). Then, when deciding who are the incoming clades of the new split, we can look at the incoming clades of the original split. If none of their state at every site are in the removed allele set, connect the new split to it. Then, look at the added allele set, and create new clade nodes using the clade node where the parent node is sampled from as a template.

I think this is somewhat equivalent to what we're suggesting, although possibly simpler. I'll defer to @ognian- on whether we'd like to replace the tree fragment merging scheme with a more direct modification of the DAG. Either way, I'll reframe this in the context of an implementation of BuildFragment (updated above in the description of BuildFragment)

Cheng provided some additional clarification about how fitch set changes are represented for an SPR move, in nodes_with_changed_major_alleles set. Some of the following is directly copied from our conversation, and some is edited slightly.

For example, with this SPR with source node 7 and target node 9: image (9)

there is a fitch set change on node 10, and on the new node which is colored red and has the new label 11.

For 10, the corresponding Mutation count change object will have decremented_allele set to 10 (one hot encoded C and T), and incremented_allele set to 0.
The new node's fitch set will be recorded in nodes_with_changed_major_alleles set as a change relative to the fitch set of the target node. Since the target node's fitch sets cannot change after an SPR move, we know that this entry is really for the new node. In this example, it would look like this:
```
Node_With_Major_Allele_Set_Change{
node=9,
major_allele_set_change=
    ......
    Mutation_Count_Change{
          .....
          incremented_allele=5,
          decremented_allele=0
}
```

Where the incremented_allele and decremented_allele values are one-hot-encoded subsets of {A,G,C,T}.

The fix to properly present new node fitch sets as changes to the target node is here: https://github.com/yceh/usher/commit/793036ec630f2fb65c05a0ae5b3d3d7c2df117c7

Finally, in the case when the source node has only one sibling, its parent remains in the tree after the SPR move despite having only one child. We will of course want to change this to avoid unifurcation when building an SPR move's fragment.

matsengrp / larch