We would like to try sampling trees from the history DAG which are topologically strange, find the median trees in the DAG, find outliers, etc. Topological strangeness will be quantified as the sum of Robinson Foulds distance from a tree to any other tree in the DAG. We can compute this using the AddFuncDict framework.
We need first a map $N$ from any clade in the hDAG (that is, any set of taxa that can be realized as the union of a node's child clades) to the number of trees in the hDAG which contain a node with that clade. This is implemented by HistoryDag.count_nodes(collapse=True).
I show here starting at line 115 that for any tree in the hDAG, the sum of the Robinson Foulds distances from that tree $t$ to every other tree in the DAG (a sum which I'll call $D_t$) is given by
$$Dt = K + \sum{e\in t} (|T| - 2N(c_e))$$
where...
$K$ is an easily computable constant which we won't worry about (it's the sum of number of clades in each tree in the DAG),
the sum is over edges $e$ in the tree $t$,
$c_e$ is the clade of the child node of the edge $e$ in the tree $t$,
and $|T|$ is the number of trees in the hDAG.
Therefore, all we need for computing the (K-shifted) sum RF distance on all trees in the DAG via SubtreeWeight is an edge function which returns this value $|T| - 2N(c_e)$ for any edge $e$. Note that the sum, and therefore the value $D_t - K$ which we'll be maximizing in trees, will probably be negative.
We would like to try sampling trees from the history DAG which are topologically strange, find the median trees in the DAG, find outliers, etc. Topological strangeness will be quantified as the sum of Robinson Foulds distance from a tree to any other tree in the DAG. We can compute this using the
AddFuncDict
framework.HistoryDag.count_nodes(collapse=True)
.SubtreeWeight
is an edge function which returns this value $|T| - 2N(c_e)$ for any edge $e$. Note that the sum, and therefore the value $D_t - K$ which we'll be maximizing in trees, will probably be negative.AddFuncDict
(see this one for an example, yours will want to take areference_dag
historydag argument, which you'll use to compute $N$ and $|T|$) https://github.com/matsengrp/historydag/blob/a197c0231b6e6fcee7a6b83ba48bc6a4396c9340/historydag/utils.py#L423 we can implement some HistoryDag methods that make use of it, in the style of https://github.com/matsengrp/historydag/blob/a197c0231b6e6fcee7a6b83ba48bc6a4396c9340/historydag/dag.py#L1891 For example, maybe amedian_trees
method to trim the dag to trees which minimize sum RF distance (there may be more than one!)