matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Fix multiple tree nodes with same name #86

Closed willdumm closed 2 years ago

willdumm commented 2 years ago

This PR is intended to fix an issue where multiple internal sequences may have the same node name.

It also adds a test in CollapsedTree._clade_tree_to_ctree which verifies that for any collapsed tree extracted from the history DAG, the maps from nodes to sequences and sequences to names are bijections.

willdumm commented 2 years ago

It turns out that with all the competing logic about node abundances, collapsed tree validation, etc, this is not as straightforward as I thought. I should be able to fix this properly as soon as I get back.

willdumm commented 2 years ago

This PR abandons any attempt to keep track of unobserved sequence names in the history DAG (they're there, but ignored).

The reason this is difficult is that ambiguous sequences are disambiguated in the DAG, and each disambiguated node that's created is given the same name as the original. This means that there are lots of different sequences associated with the same unobserved node name.

You'd think we could just rename all dag nodes after disambiguation, but this turns out to be tricky because the dag has some uncollapsed leaf-adjacent edges, and because of the logic which adds/removes a dummy leaf for unobserved, unifurcating root.

Because of all that, it's simpler to just fix the unobserved node names when collapsed trees are created. Note that this means unobserved node names will change when initializing a CollapsedTree.

Here are the guarantees:

psathyrella commented 2 years ago

This is awesome, thank you!