matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Rethink isotype inference #116

Open willdumm opened 1 year ago

willdumm commented 1 year ago

Currently, gctree optionally uses 'isotype parsimony', which is the number of isotype switching events required along a tree, given observed leaf isotypes, to help rank MP trees (alongside likelihood and mutability parsimony).

Isotypes are added to trees in the DAG by assigning all internal (unobserved) nodes the earliest isotype observed on any of the leaves below them. This is guaranteed to yield a labeling that minimizes the number of isotype switching events on the tree, and with only allowed isotype transitions. However, since isotype doesn't influence the tree topology, this method sometimes results in lots of isotype switching which could be avoided by adding a few extra nodes. A scenario where this occurs is below. Although this behavior could be okay for ranking trees according to isotype parsimony, it's not great for understanding where isotype switching happens in the tree, because it seems likely that the real tree looks a bit different:

image

In order to understand where class switching really happens in the tree, we propose doing something slightly different, where when provided with isotype data, we:

From the example above, we would be considering both the tree on the right, and the following one (which benefits from a partial resolution of the multifurcation, using isotypes (but if isotypes weren't provided, we would only consider for ranking the tree on the left in the above image):

image

This tree seems more plausible, because it places the high-abundance node above more children in the [isotype, sequence]-collapsed tree, and because it has only one isotype switching event.

Here's a more detailed example (with a different starting MP tree) showing the steps in the list above. Here edges without mutations are marked with a slash, and inferred ancestral isotypes are surrounded in parentheses.

image

Implementation Details:

At some point (#91), I spent awhile making gctree work with ambiguous input sequences. This was quite difficult, because different MP trees have different resolved tip sequences, resulting in different collapsed abundances between trees. To make it possible, there's a somewhat complicated scheme involving abundances as part of hDAG node labels, and involved pre-processing of MP trees from dnapars.

To make these proposed isotyping changes easier, I'd like to revert those changes, making gctree only work with fully-resolved sequences again. One could always use one of the versions that does support ambiguous sequences if they need that feature, but it doesn't seem to be a feature which core gctree users need, and not having it would make the code much cleaner and easier to modify in the future, including for the proposed changes in this issue.

Whether inference is provided with isotype data will affect at least the following: