matsengrp / larch

Inference and manipulation of history DAGs
2 stars 2 forks source link

merge fragments after reassign_states call #47

Closed marybarker closed 11 months ago

marybarker commented 1 year ago

There is a slight inefficiency in the sample-optimize-merge loop due to MatOptimize's requirement that all edges have at least one mutation. At present, this is resolved in the following setup:

  1. sample a tree from the DAG called sample
  2. create a MAT copy tree
  3. update tree to satisfy MatOptimize's criterion by calling reassign_states
  4. merge the updated tree into the DAG to create a correct map between DAG node ids and MAT node ids
  5. optimized tree

It is important that the DAG contains the same nodes as the set of nodes in tree, and that the corresponding map between structures gives those node indices. This is the only way that the callback can correctly interpret proposed moves, and is therefore crucial to the accuracy of this routine.

The proposed speedup is in step 4. Instead of merging the entire updated tree into the DAG, we might want to recover the fragments of the tree containing new/altered nodes resulting from calling reassign_states.

The following plots show that the number of nodes that are actually changed in reassign_states calls is minimal, compared to the overall number of nodes in the sampled tree. Instead of rebuilding a map for the entire set of tree nodes at each iteration, we might potentially only need to map a small subset of nodes.

Starting with a tree containing >44,000 nodes, the sample tree and updated tree were saved at the beginning of each larch-usher iteration for 50 iterations. The proportion of tree nodes that were altered is extremely small, and the following set of plots explore some of the questions we can ask about how calling reassign_states changes the trees in iterated sample-optimize-merge calls.

Plot 1: does the number of altered nodes increase/decrease/matter?

percent_changed This plot shows that sampling parsimony-optimal subtrees results in trees that have slightly fewer altered nodes, although the difference between optimal subtrees and uniformly chosen subtrees is minimal.

Plot 2: how different topologically are the trees before and after?

rf_distance The RF distance between sample and updated tree is extremely small in both cases.

Plot 3: how many of the changed nodes are topologically different, and how many are just relabelled?

rf_dist_vs_num_changed The RF distance is based on the symmetric difference between the set of splits in each tree. Therefore it gives twice the number of topologically new nodes in the altered_tree.

Notes:

marybarker commented 11 months ago

this was addressed in issues #6 and issue #44