matsengrp / historydag

https://matsengrp.github.io/historydag
GNU General Public License v3.0
0 stars 1 forks source link

Support for leaf labels with ambiguities #38

Open marybarker opened 1 year ago

marybarker commented 1 year ago

It seems likely that we can accommodate ambiguous bases in leaf labels using the mutation annotated dag/compact genome DAG structure, and do so without losing the parsimony criterion, as long as the internal node labels are fully unambiguous.

If the internal node labels (in compact genome format) are fully disambiguated, then for any tree sampled from the DAG, its parsimony score can be computed explicitly from the internal nodes, and minimized over the resolutions of ambiguous leaf labels.

We need to explicitly prove the ansatz that resolutions of leaf labels can be performed independently of each other, and here is a place to start that conversation.

willdumm commented 1 year ago

Let $h'$ be the minimum possible hamming distance on an edge (between two nodes with ambiguous sequences). Consider a history $t = (V_t, Et)$ whose leaves have ambiguous sequence labels, but whose internal node sequence labels are unambiguous. We need to show that $$\sum{e\in Et} h'(e) = \min{t' \text{ a resolution of t }} \sum{e\in E{t'}} h(e)$$

However, in such a tree $t$, all nodes with ambiguous sequences have degree 1, and their unique parent has an unambiguous sequence. To minimize the sum on the right is to choose leaf sequences which minimizes $h$ on each pendant edge. Since each such choice of sequence changes the value of $h$ on any other edge, we have the equality above.

This $h'$ then works as an edge-weight function in the hDAG, with all the properties of normal hamming distance on an unambiguous dag, as long as all internal nodes in the DAG ( and therefore all internal nodes in any tree in the DAG) have unambiguous sequences.

willdumm commented 1 year ago

In historydag this will be provided by a subclass of HistoryDag expecting each node label to contain a leaf_ambiguous_sequence field, or something similar, which will be expected to contain no ambiguities unless it is on a leaf node.

We'll need a version of hamming distance which returns the minimum distance possible between two nodes.

Not sure if we'd ever want it, but we could also have a subclass that includes ambiguous sequences at every node, which would allow e.g. min-weight-ambiguous labeled trees to be used to create a DAG... I guess we need to think about how these things are related

In Larch, we need a way of keeping track of leaf IDs, so we can correctly replace the resolved sequences chosen for leaves by matOptimize with the original ambiguous sequences before merging back to the DAG. We also want to provide ambiguous sequences (in the internal representation for vcf data of usher) to matOptimize when optimizing a tree.