Closed willdumm closed 2 years ago
It turns out that with all the competing logic about node abundances, collapsed tree validation, etc, this is not as straightforward as I thought. I should be able to fix this properly as soon as I get back.
This PR abandons any attempt to keep track of unobserved sequence names in the history DAG (they're there, but ignored).
The reason this is difficult is that ambiguous sequences are disambiguated in the DAG, and each disambiguated node that's created is given the same name as the original. This means that there are lots of different sequences associated with the same unobserved node name.
You'd think we could just rename all dag nodes after disambiguation, but this turns out to be tricky because the dag has some uncollapsed leaf-adjacent edges, and because of the logic which adds/removes a dummy leaf for unobserved, unifurcating root.
Because of all that, it's simpler to just fix the unobserved node names when collapsed trees are created. Note that this means unobserved node names will change when initializing a CollapsedTree.
Here are the guarantees:
This is awesome, thank you!
This PR is intended to fix an issue where multiple internal sequences may have the same node name.
It also adds a test in
CollapsedTree._clade_tree_to_ctree
which verifies that for any collapsed tree extracted from the history DAG, the maps from nodes to sequences and sequences to names are bijections.