Add filtering by reversions to naive sequence, generalize ranking strategy

willdumm commented 2 months ago

Motivation: Minimizing reversions

We continue to receive feedback that users of gctree often see reversions in the inferred trees, and that they edit the trees to remove these. Unfortunately, counting the total number of reversions in each tree efficiently is difficult because of the data structure we use for storing trees. However, it is easy to count only reversions to the naive sequence.

The primary purpose of this PR is to add the option to rank trees according to the number of reversions to the naive base at any site. However, experimenting with real gcreplay data, I see that prioritizing reversions above context likelihood or branching process likelihood often results in a significant reduction in likelihood of the top-ranked tree. Users of this option should consider how their intuition about reversions relates to the intuition behind likelihood models already available.

Also keep in mind that reversions are an inevitable feature of parsimony trees, when data they’re constructed on contains sites that support conflicting partitions of the taxa. A tree without reversions is possible in this case, but it will not be maximally parsimonious (it will contain many duplicate mutations occurring independently on different branches). Trees constructed using non-parsimony methods on densely sampled data will almost certainly also contain reversions, because it’s considered unlikely for the same mutation to occur independently many times. Since GCtree only ranks maximally parsimonious trees, it cannot completely eliminate reversions.

Additional changes in this PR

Adding reversion counting as a ranking criterion brings us to a total of six available ranking criteria. Previous versions of gctree allowed ranking of trees in two ways: either lexicographically with a fixed, default order of criteria (BP likelihood, isotype parsimony, context likelihood, allele count), or using a linear combination of available criteria, specified using the options --branching_process_ranking_coeff, --ranking_coeffs, and --use_old_mut_parsimony. This system of options was a big mistake, because it’s hard to add additional criteria while maintaining backward compatibility (the number of ranking coefficients needs to change) and despite its complexity it doesn’t allow customization of lexicographic orderings.

For this reason, I think it would be best to deprecate the three CLI options mentioned above, and replace them with a --ranking_strategy option which takes a string expression. Here’s how this is described in the docs:

Expression describing tree ranking strategy. If provided, takes precedence over all other ranking arguments. Two types of expressions are permitted: First are those describing lexicographic orderings, like B,C,A, which means choose trees to maximize branching process likelihood, then maximize context likelihood, then minimize number of alleles. Next are expressions describing linear combinations of criteria, like B+2C-1.1A, which means choose trees to minimize the specified linear combination of criteria. If linear combination expression has leading -, use = instead of space to separate argument, e.g. --ranking_strategy=-B+R. These two methods of ranking cannot be combined. For example, B+C,A is not a valid ranking strategy expression. Ranking criteria are specified using the following identifiers: B - branching process likelihood (default maximized), I - isotype parsimony (default minimized), C - context-based Poisson likelihood (default maximized), M - old mutability parsimony (deprecated, default minimized), A - number of alleles (default minimized), R - sitewise reversions to naive sequence (default minimized) To compute the value of a criterion on ranked trees without affecting the ranking, include that ranking criterion with a coefficient of zero, as in B+2C+0A, or B,C,0A. To reverse the default ordering on a criterion in lexicographic ranking, provide a negative coefficient. For example, B,-A will first maximize branching process likelihood, then maximize the number of alleles, since the default is to minimize the number of alleles. -B, A will minimize branching process likelihood, since the default is to maximize it, then minimize alleles. A ranking strategy string containing a single ranking criterion identifier will be treated as a lexicographic ordering. That is, B will maximize branching process likelihood, and -B will minimize branching process likelihood, while B+0C will minimize branching process likelihood and -B+0C will maximize it. gctree infer --verbose will describe the ranking strategy used. Examine this output to make sure it’s as expected.

It is difficult to maintain consistent behavior and backward compatibility with the old CLI options, so their use now throws an error, urging the user to employ the new --ranking_strategy option. However, the default behavior without any of these CLI options remains identical to older versions of gctree.

If we ever want to add additional ranking criteria to gctree, these changes will make it very straightforward to do so while maintaining backward-compatibility.

Other changes:

use Pandas DataFrame printing utilities to format gctree output more consistently
change names of ranking criteria: Mut. Pars. to MutabilityParsimony and Isotype Pars. to IsotypeParsimony for consistency and to aid in manual parsing of gctree log files
add tests to tests/smalltest.sh for various kinds of --ranking_strategy expressions.
as mentioned in the quoted docstring above, we can compute criteria without ranking with respect to them by giving them a coefficient of zero. These criteria are included in outputs from --tree_stats (a csv file and pair plot showing all trees and best ranked tree) and --summarize_forest.

Remaining questions / TODOs:

I think it's nice for a ranking strategy of e.g. B,C,R to automatically maximize branching process likelihood and context likelihood, but minimize reversions. However, as noted in the docstring for the --ranking_strategy option, this creates ambiguity about the behavior of the ranking strategy B and -B, since linear combinations of criteria are always minimized. However, gctree output includes a description of how ranking is done, so that should help. For example: Ranking trees to maximize LogBPLikelihood then minimize IsotypeParsimony then maximize LogContextLikelihood then minimize Alleles then minimize NaiveReversions or Ranking trees to minimize a linear combination of -1(LogBPLikelihood) + 0.01(IsotypeParsimony) + -1(LogContextLikelihood). There is also a warning any time a coefficient's sign contradicts the default ordering. The alternative obvious approach would be to expect the user to sprinkle in negative signs whenever they want a criterion maximized rather than minimized, for instance -B,-C,R to do what B,C,R does now. That would eliminate any ambiguity/inconsistency, but it's less intuitive. Thoughts @WSDeWitt?
I'll need to update the Quickstart to describe these changes

willdumm commented 2 months ago

Would close #90

willdumm commented 1 month ago

Thanks for the point about log likelihoods, I'll change that.

Is your proposal to change criterion definitions your recommendation? It seems reasonable.

I did look at how reversion counting would change inference on the gcreplay data. Keep in mind that this is reversion to the naive sequence, so it doesn't necessarily imply that there are fewer reverted mutations in the tree.

I compared these lexicographic orderings:

BP likelihood, context likelihood, alleles, reversion count (essentially the currently used method, since context likelihood almost always picks out a single tree)
BP likelihood, reversions, context likelihood, alleles
reversions, BP likelihood, context likelihood, alleles

Over 145 alignments, prioritizing naive reversions over context likelihood, but less than branching process likelihood results in a reversion reduction in 6.21 percent of gcreplay alignments. The median improvement was 1. The max improvement was 4. The median log context likelihood reduction was 0.77. The max log context likelihood reduction was 8.54.

Prioritizing naive reversions over everything else, including branching process likelihood results in a reversion reduction in 42.07 percent of gcreplay alignments. The median improvement was 1. The max improvement was 10. The median log bp likelihood reduction was 0.61. The max log bp likelihood reduction was 2.77.

willdumm commented 1 month ago

Further changes and additions:

Will's suggestion about minimizing all criteria by default is now implemented, and instead of log likelihoods we call them log loss.
The Quickstart is updated to reflect changes made in this PR
An option to draw nucleotide mutations on tree branches --show_nucleotide_mutations is added. This is nice for looking at reversions, since the existing aa mutation annotations don't show silent mutations
When ranking by linear combination, the value of the ranking expression is shown on the pair plot produced by --tree_stats

I think this is ready to merge @WSDeWitt. Since it contains changes that aren't backward-compatible I suppose it should have a new major version.

willdumm commented 1 month ago

I've attempted to re-create the behavior of the old ranking coefficient arguments by translating them into a ranking strategy expression, if provided. I brought back some old tests to make sure that this works, so we should now be able to release this as something like v4.3.0

matsengrp / gctree