matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Simplify abundance logic and remove tuple sequence names #89

Closed willdumm closed 2 years ago

willdumm commented 2 years ago
willdumm commented 2 years ago

from the other PR #88 for this issue, which will be replaced by this one: This is a simple fix for an issue that I've fixed before, but which showed up again after some recent changes.

For future reference, some documentation for how trees are processed to allow use of the history DAG:

The history DAG does not permit unifurcations in the trees it expresses. The original version also requires that all expressed trees contain the same set of leaf sequences (this requirement has since been relaxed, but we'll take advantage of that in some future gctree changes, not these).

Since the dnapars trees have a constrained root sequence, they often have root unifurcation. If any dnapars tree has root unifurcation, then a pseudo leaf node is added to each one, as a child of the root node. This pseudo leaf has name '', the same sequence as the root node (the naive sequence), and the same abundance as the root node (or zero, if the naive sequence is not observed).

Once the history DAG is created from these trees (and disambiguated, and made complete), it is collapsed, so that only leaf-adjacent edges may have parent and child nodes with exactly the same sequence. Each node is annotated with the abundance associated to its sequence.

Since branching process log likelihood decomposes as a sum over (collapsed tree) nodes, log likelihood is computed in the DAG by adding over all edges, the log likelihood contribution of the edge's child node. Uncollapsed leaf-adjacent edges are ignored (meaning that any pseudo-leaves are also ignored). A (child) node's abundance is read from the annotation, and its number of mutant descendants is simply the number of child clades it possesses, ignoring child clades whose descendant edges are uncollapsed, leaf-adjacent edges.

After trees are extracted from the history DAG, any leaves with name '' are deleted. There should not be more than one such leaf per tree, and if there is, an error is raised. This should only be possible if one of the observed sequences has name '' also.