Rethink isotype inference

Currently, gctree optionally uses 'isotype parsimony', which is the number of isotype switching events required along a tree, given observed leaf isotypes, to help rank MP trees (alongside likelihood and mutability parsimony).

Isotypes are added to trees in the DAG by assigning all internal (unobserved) nodes the earliest isotype observed on any of the leaves below them. This is guaranteed to yield a labeling that minimizes the number of isotype switching events on the tree, and with only allowed isotype transitions. However, since isotype doesn't influence the tree topology, this method sometimes results in lots of isotype switching which could be avoided by adding a few extra nodes. A scenario where this occurs is below. Although this behavior could be okay for ranking trees according to isotype parsimony, it's not great for understanding where isotype switching happens in the tree, because it seems likely that the real tree looks a bit different:

In order to understand where class switching really happens in the tree, we propose doing something slightly different, where when provided with isotype data, we:

pre-process MP trees from dnapars by splitting pendant branches whose leaf nodes represent multiple observed isotypes
add inferred ancestral isotypes to internal nodes
partially resolve multifurcations when at least one child of the multifurcating node has the same sequence as its parent but different isotype (this will be done independently on different subtrees using hDAG infrastructure). Also keep the original multifurcating structure as an alternative.
collapse with respect to both isotype and sequence, and fit branching process parameters on this notion of collapsed tree (so the trees we get when doing inference with isotype data are fundamentally different structures than those we get from gctree when not providing isotype data)
rank trees w.r.t. likelihood (and possibly mutability parsimony), but no longer use isotype parsimony in ranking.

From the example above, we would be considering both the tree on the right, and the following one (which benefits from a partial resolution of the multifurcation, using isotypes (but if isotypes weren't provided, we would only consider for ranking the tree on the left in the above image):

This tree seems more plausible, because it places the high-abundance node above more children in the [isotype, sequence]-collapsed tree, and because it has only one isotype switching event.

Here's a more detailed example (with a different starting MP tree) showing the steps in the list above. Here edges without mutations are marked with a slash, and inferred ancestral isotypes are surrounded in parentheses.

Implementation Details:

At some point (#91), I spent awhile making gctree work with ambiguous input sequences. This was quite difficult, because different MP trees have different resolved tip sequences, resulting in different collapsed abundances between trees. To make it possible, there's a somewhat complicated scheme involving abundances as part of hDAG node labels, and involved pre-processing of MP trees from dnapars.

To make these proposed isotyping changes easier, I'd like to revert those changes, making gctree only work with fully-resolved sequences again. One could always use one of the versions that does support ambiguous sequences if they need that feature, but it doesn't seem to be a feature which core gctree users need, and not having it would make the code much cleaner and easier to modify in the future, including for the proposed changes in this issue.

Whether inference is provided with isotype data will affect at least the following:

How MP trees are pre-processed
hDAG node label data
the extraction of CollapsedTrees from the hDAG CollapsedForest object
rendering of CollapsedTrees (renderings should indicate isotype of nodes)

matsengrp / gctree

Rethink isotype inference #116

Implementation Details: