matsengrp / larch

Inference and manipulation of history DAGs
MIT License
2 stars 2 forks source link

Apply individual matOptimize moves to the DAG #6

Closed matsen closed 1 year ago

matsen commented 2 years ago

This issue follows after #5 .

In the long run we will want to have some "unit of tree modification" that isn't a whole tree. That is, if matOptimize finds an additional good tree modification that isn't in the DAG already, how can we communicate about that tree modification other than merging in the entire new tree (most of which will be totally redundant)?

@davidrich27 has been doing great work on encoding NNI moves by pairs of nodes, encoded by their subsplits. It seems like we could have something like that here.

Taking a look at the matOptimize code, they certainly have ways of describing moves and then applying them. For example, individual_move appears to be the way in which moves get applied.

Could we have a little hook in there that sees that a new move is being applied and then applies it to the DAG? It would seem best in this case to have a means of going from the current tree to its embedding in the DAG, which seems like some non-trivial bookkeeping but not impossible.

yceh commented 2 years ago

Sorry, individual_move is for testing whether a single move is profitable, used in recycling conflicting moves. The actual function that applies move is apply_move. It apply a batch of moves to armotize the cost of recalculating Fitch/Boundary allele sets. We don't have specialization of this function to apply a single move.

willdumm commented 1 year ago

Here's Cheng's response when asked if we can get recomputed fitch and boundary allele sets with each move:

MatOptimize actually doesn't compute boundary allele set change during the enumeration phase, it only computes Fitch set changes. The set of nodes with Fitch set changed is passed in node_with_major_allele_set_change argument at https://github.com/yceh/usher/blob/8257c0bf643014707d99b3101557cf6125e7ba71/src/ma[…]e/Profitable_Moves_Enumerators/Profitable_Moves_Enumerators.hpp.

So the change in the Fitch set, as well as the original Fitch set, can be accessed via the Node_With_Major_Allele_Set_Change objects in the set node_with_major_allele_set_change, which is passed to the callback.

From this I suppose we could choose an acceptable re-labeling of those new nodes in the DAG (we'd have to re-compute boundary alleles, but that should be easy), if we think this will still be faster than doing our own thing.

cc: @marybarker @matsen @ognian-

willdumm commented 1 year ago

A good first step for implementing this issue will probably be #43 ^^

marybarker commented 1 year ago

Here is an attempt at summary:

SPR moves are done on a tree sampled from the DAG. Given a proposed move, we have the following information, accessible through the callback:

This involves creating a new DAG node for each node in this path, so that we capture both the updated clades below each node along the path, and the new labeling from the Fitch-Sankoff algorithm (which is stored in the node_with_major_allele_set_change). It is worth noting that it is not only nodes that are ancestral to src and dst that might have altered sequences to optimize the parsimony score after a move. In fact, all of the nodes that are a direct child of a node along the path between src and dst may have altered sequences.

New clade sets

See this issue for a discussion of the new nodes that are added (in terms of clades and connectivity). Each new node corresponds to a node that is in the tree.

New node labels

The node_with_major_allele_set_change type contains the information: (nodeId, old Fitch set, new Fitch set) and so we can recover the altered Fitch sets.

Translating Fitch sets into compact genomes

This is another step that merits a separate issue.

Merging changes directly

This is written more clearly in this issue. Since we want to reconstruct all of the new/changed structures in the new tree without actually merging it into the DAG, we will need a way of mergin a tree-shaped substructure (not necessarily a whole subtree, which would include leaf nodes) into the DAG.