Fix super-long insertions/deletions in large, high-mutation families

psathyrella commented 3 years ago

A longstanding issue with partis is that in large, high-shm families, the multi-hmm infers spuriously long insertions and deletions, for example (top: true, bottom: inferred):

This is simply a consequence of the star tree approximation: a star tree means interpreting each point mutation in each sequence as a separate mutation event, which means that in the presence of large amounts of shared mutation among sequences it tries to reduce this (incorrectly) large number of independent mutations by expanding insertions/deletions.

Unfortunately the star tree assumption is tightly linked to having any hope of being fast enough to run on large repertoires: it means that calculating emission probabilities on a multi-sequence annotation is just multiplying independent probabilities. When originally implementing the hmm in 2015 I tried a number of heuristics for moderating the star tree assumption (e.g. "these sequences share a lot of mutations, so they're probably on nearby branches, so adjust their joint emission probs accordingly"), but eventually decided that anything useful would have to amount to doing actual phylogenetic inference in place of just multiplying independent probabilities. Luckily, Erick and Amrit did just this in linearham, which is a fantastically accurate tool for getting a more accurate picture of individual lineages of interest.

However that left open the problem of finding a more accurate approximation for partis that could run on whole repertoires. I implemented a potential fix a while ago that looked for multi-sequence annotations whose corresponding single-sequence smith-waterman annotations had much shorter deletions, but ran out of steam before doing enough tweaking and validation to feel like it was a real improvement. The main issue was that the star tree's over-enthusiastic expansion of insertions/deletions turned out to be a pretty good optimization to improve naive sequence inference accuracy: non-templated regions (which you get more of if your insertions/deletions are too long) basically just take the consensus of observed sequences, so tend to be great at inferring naive sequences, despite being bad at inferring insertion/deletion length and N SHMs.

So, anyway, I finally got around to working on this again, and came up with two changes that fix the issue. You'll still get more accurate naive sequences with linearham, but this means that the stupid over-long insertions that partis would sometimes spit out on large high-mutation families shouldn't happen any more.

The first change is simple, and probably should have been implemented to begin with, but I didn't realize it was necessary: when building the hmm yaml model files, just forbid deletions that erode the conserved codons. So don't let V 3' deletions extend into the conserved cysteine, or J 5' deletions extend into the tryp/phen. We probably occasionally observe real non-productive sequences that have deletions longer than this, but we generally don't care about them (and if we do, there's an option to allow the longer deletions: --allow-conserved-codon-deletion).

The second change is to finalize the single-sequence smith-waterman hack above: for each boundary (vd and dj), if the multi-seq annotation's 5' deletion is more than one std dev away from the mean of the single-seq smith-waterman 5' deletions, we reduce the multi-seq annotation's deletion/insertion lengths til they're one std dev from that mean. This can also be turned off (EDIT well it could be, e.g. here, but the correction doesn't exist any more because subcluster annotation, see below, is better). Using only 5' (rather than 3' or both) is arbitrary, but seems to work just fine and makes implementation much easier.

Most of the validation for these changes consisted of exaggerating the effect, so simulating weird trees (using bin/bcr-phylo-run.py just to get the .nwk), with high mutation (so simulating with plain partis but with the bcr-phylo .nwk as input tree, with --mutation-multiplier), and with --mutate-conserved-codons turned on.

An example with 25-sequence families is shown below, first for boundary-length performance, then mutation performance (the x-axis is poorly labelled here: it's the inferred - true value except for hamming distance): There's four versions shown, for each of the four possible combinations of each change turned off or on: erode_XXX : allow conserved codon erosion (old default) preserve_XXX: preserve conserved codons (i.e. forbid their deletion) XXX_boundaries: old (uncorrected) deletion boundaries XXX_1.0: apply new sw boundary correction with 1.0 std devs The basic picture is that both changes basically fix the problem by themselves. The sw boundary correction is more effective (which makes sense -- it goes further than just forbidding conserved codon deletion). The skinny red line (applying both changes) is essentially always best, although fat red (only apply sw boundary correction) is usually pretty close. That is with very high shm -- averaging 20% nucleotide. Here's similar plots for 10-sequence clusters with more typical shm (average 3-5% nucleotide; results are pretty similar):

Whereas here's results with high mutation, but small (3-sequence) clusters (the corrections don't make much difference, since the star tree is fine for 3-sequence trees):

psathyrella commented 3 years ago

attaching run script

EDIT: removing this, since i ended up committing it (then removing it), see here https://github.com/psathyrella/partis/commit/cad647b59c16df951eb572d016bf51a504cec906

psathyrella commented 3 years ago

oh wait nvmd i have a way better way to do this. Since everything's a star tree if you limit to a small enough bit of the tree, I can divide up large families/clusters into (much) smaller subclusters, where each subcluster is small enough that it's star-like. Then I can rerun on the inferred naive sequences from those subclusters, iterating as necessary until there's few enough inferred naive sequences that it's only one subcluster.

EDIT (2020-06-10) just wanted to add that this approach ("subcluster annotation") indeed works well and has been the default for a while. The size of the subclusters is controlled with --subcluster-annotation-size, which defaults to 3. I did a lot of validation (similar to the plots above) to work out the best value for this. The original validation stuff is in /fh/fast/matsen_e/dralph/partis/fix-super-long-insertions/subcluster-annotation-*, but I'll of course redo it to go in a paper. I also just made subcluster annotation work with alternative annotations: just add the initial round of subclusters (of size about --subcluster-annotation-size) to the full-final cluster annotation, and pass into the alternative annotation processing function. This is probably a lot better than the old method of passing in the annotations for ~all clusters that were calculated during clustering, partly because it's usually way fewer annotations, so way faster, but mostly there's no double-counting (with the new way all seqs appear exactly twice), so the uncertainty estimates should be a lot better. I've added some details to the subcommands section of the manual.

psathyrella commented 3 years ago

https://github.com/psathyrella/partis/commit/8c7c9e06d37a049d5f446edd1f60974b22eb941c

psathyrella / partis

Fix super-long insertions/deletions in large, high-mutation families #308