merge fragments after reassign_states call

There is a slight inefficiency in the sample-optimize-merge loop due to MatOptimize's requirement that all edges have at least one mutation. At present, this is resolved in the following setup:

sample a tree from the DAG called sample
create a MAT copy tree
update tree to satisfy MatOptimize's criterion by calling reassign_states
merge the updated tree into the DAG to create a correct map between DAG node ids and MAT node ids
optimized tree

It is important that the DAG contains the same nodes as the set of nodes in tree, and that the corresponding map between structures gives those node indices. This is the only way that the callback can correctly interpret proposed moves, and is therefore crucial to the accuracy of this routine.

The proposed speedup is in step 4. Instead of merging the entire updated tree into the DAG, we might want to recover the fragments of the tree containing new/altered nodes resulting from calling reassign_states.

The following plots show that the number of nodes that are actually changed in reassign_states calls is minimal, compared to the overall number of nodes in the sampled tree. Instead of rebuilding a map for the entire set of tree nodes at each iteration, we might potentially only need to map a small subset of nodes.

Starting with a tree containing >44,000 nodes, the sample tree and updated tree were saved at the beginning of each larch-usher iteration for 50 iterations. The proportion of tree nodes that were altered is extremely small, and the following set of plots explore some of the questions we can ask about how calling reassign_states changes the trees in iterated sample-optimize-merge calls.

Plot 1: does the number of altered nodes increase/decrease/matter?

percent_changed This plot shows that sampling parsimony-optimal subtrees results in trees that have slightly fewer altered nodes, although the difference between optimal subtrees and uniformly chosen subtrees is minimal.

Plot 2: how different topologically are the trees before and after?

rf_distance The RF distance between sample and updated tree is extremely small in both cases.

Plot 3: how many of the changed nodes are topologically different, and how many are just relabelled?

rf_dist_vs_num_changed The RF distance is based on the symmetric difference between the set of splits in each tree. Therefore it gives twice the number of topologically new nodes in the altered_tree.

Notes:

Since these cases were run on the same tree-shaped DAG, the initial sample is identical: the original tree. Therefore we should expect to see (and do see) the number of changed nodes to be identical for the first iteration.
The maximum hamming distance between any altered node and its relabelled counterpart is 1 for all iterations when sampling optimal subtrees, and the maximum is 2 for sampling subtrees uniformly.
The average value of total number of changed nodes divided by the total number of nodes in the tree is 0.approximately 0.001 in both cases.

matsengrp / larch