matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Add filtering by reversions to naive sequence, generalize ranking strategy #130

Closed willdumm closed 1 month ago

willdumm commented 2 months ago

Motivation: Minimizing reversions

We continue to receive feedback that users of gctree often see reversions in the inferred trees, and that they edit the trees to remove these. Unfortunately, counting the total number of reversions in each tree efficiently is difficult because of the data structure we use for storing trees. However, it is easy to count only reversions to the naive sequence.

The primary purpose of this PR is to add the option to rank trees according to the number of reversions to the naive base at any site. However, experimenting with real gcreplay data, I see that prioritizing reversions above context likelihood or branching process likelihood often results in a significant reduction in likelihood of the top-ranked tree. Users of this option should consider how their intuition about reversions relates to the intuition behind likelihood models already available.

Also keep in mind that reversions are an inevitable feature of parsimony trees, when data they’re constructed on contains sites that support conflicting partitions of the taxa. A tree without reversions is possible in this case, but it will not be maximally parsimonious (it will contain many duplicate mutations occurring independently on different branches). Trees constructed using non-parsimony methods on densely sampled data will almost certainly also contain reversions, because it’s considered unlikely for the same mutation to occur independently many times. Since GCtree only ranks maximally parsimonious trees, it cannot completely eliminate reversions.

Additional changes in this PR

Adding reversion counting as a ranking criterion brings us to a total of six available ranking criteria. Previous versions of gctree allowed ranking of trees in two ways: either lexicographically with a fixed, default order of criteria (BP likelihood, isotype parsimony, context likelihood, allele count), or using a linear combination of available criteria, specified using the options --branching_process_ranking_coeff, --ranking_coeffs, and --use_old_mut_parsimony. This system of options was a big mistake, because it’s hard to add additional criteria while maintaining backward compatibility (the number of ranking coefficients needs to change) and despite its complexity it doesn’t allow customization of lexicographic orderings.

For this reason, I think it would be best to deprecate the three CLI options mentioned above, and replace them with a --ranking_strategy option which takes a string expression. Here’s how this is described in the docs:

Expression describing tree ranking strategy. If provided, takes precedence over all other ranking arguments. Two types of expressions are permitted: First are those describing lexicographic orderings, like B,C,A, which means choose trees to maximize branching process likelihood, then maximize context likelihood, then minimize number of alleles. Next are expressions describing linear combinations of criteria, like B+2C-1.1A, which means choose trees to minimize the specified linear combination of criteria. If linear combination expression has leading -, use = instead of space to separate argument, e.g. --ranking_strategy=-B+R. These two methods of ranking cannot be combined. For example, B+C,A is not a valid ranking strategy expression. Ranking criteria are specified using the following identifiers: B - branching process likelihood (default maximized), I - isotype parsimony (default minimized), C - context-based Poisson likelihood (default maximized), M - old mutability parsimony (deprecated, default minimized), A - number of alleles (default minimized), R - sitewise reversions to naive sequence (default minimized) To compute the value of a criterion on ranked trees without affecting the ranking, include that ranking criterion with a coefficient of zero, as in B+2C+0A, or B,C,0A. To reverse the default ordering on a criterion in lexicographic ranking, provide a negative coefficient. For example, B,-A will first maximize branching process likelihood, then maximize the number of alleles, since the default is to minimize the number of alleles. -B, A will minimize branching process likelihood, since the default is to maximize it, then minimize alleles. A ranking strategy string containing a single ranking criterion identifier will be treated as a lexicographic ordering. That is, B will maximize branching process likelihood, and -B will minimize branching process likelihood, while B+0C will minimize branching process likelihood and -B+0C will maximize it. gctree infer --verbose will describe the ranking strategy used. Examine this output to make sure it’s as expected.

It is difficult to maintain consistent behavior and backward compatibility with the old CLI options, so their use now throws an error, urging the user to employ the new --ranking_strategy option. However, the default behavior without any of these CLI options remains identical to older versions of gctree.

If we ever want to add additional ranking criteria to gctree, these changes will make it very straightforward to do so while maintaining backward-compatibility.

Other changes:

Remaining questions / TODOs:

willdumm commented 2 months ago

Would close #90

willdumm commented 1 month ago

Thanks for the point about log likelihoods, I'll change that.

Is your proposal to change criterion definitions your recommendation? It seems reasonable.

I did look at how reversion counting would change inference on the gcreplay data. Keep in mind that this is reversion to the naive sequence, so it doesn't necessarily imply that there are fewer reverted mutations in the tree.

I compared these lexicographic orderings:

Over 145 alignments, prioritizing naive reversions over context likelihood, but less than branching process likelihood results in a reversion reduction in 6.21 percent of gcreplay alignments. The median improvement was 1. The max improvement was 4. The median log context likelihood reduction was 0.77. The max log context likelihood reduction was 8.54.

Prioritizing naive reversions over everything else, including branching process likelihood results in a reversion reduction in 42.07 percent of gcreplay alignments. The median improvement was 1. The max improvement was 10. The median log bp likelihood reduction was 0.61. The max log bp likelihood reduction was 2.77.

willdumm commented 1 month ago

Further changes and additions:

I think this is ready to merge @WSDeWitt. Since it contains changes that aren't backward-compatible I suppose it should have a new major version.

willdumm commented 1 month ago

I've attempted to re-create the behavior of the old ranking coefficient arguments by translating them into a ranking strategy expression, if provided. I brought back some old tests to make sure that this works, so we should now be able to release this as something like v4.3.0