psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 36 forks source link

germline inference from bad sequences #281

Closed scharch closed 5 years ago

scharch commented 5 years ago

I know partis does germline inference from clonally-collapsed sequences, which I guess means that in-dels get reverted before mutations are counted. Still, I'm wondering what the limit might be on the ability of that to compensate for rotten data (in this case, an old 454 run where 90% of the sequences have frameshift errors). partis correctly marks these as invalid, but it's not clear to me if they are still used in the inference or how badly that might throw things off.

psathyrella commented 5 years ago

Yes, germline inference happens after shm indel reversal.

I think it could compensate ok, and the extent to which it does should just depend on how effective the shm-indel finding is.

So if all the sequencing errors are indels, and the shm indel reversal (in smith waterman) finds all of them and correctly reverses them, then as far as germline inference is concerned there aren't any sequencing errors.

If, on the other hand, we're not perfectly finding indels, this would I think take the form of spurious point mutations, which, I think, if they're distributed throughout the V, shouldn't result in spurious germline inference. To get spurious germline inference, I think you generally need some kind of sequencing error that is biased by position -- for instance if the seventh base is always error'd to T, well of course that looks identical to a new allele. I think missed shm indels would result in strong correlations between mutations in adjacent positions, but I don't think that will screw up the germline inference unless there's also a strong preference for particular positions.

So then we'd need to measure how well the shm indel finding is performing, so we can see how close to either of those extremes we are. (Although, ok, I did just claim that even if we're doing badly it shouldn't screw up the germline inference much, but clearly it's better if we're in the first regime. If nothing else, adding more mutations, whether real or not, just reduces your sample size, and makes everything noisier).

I think shm indel finding will fail in basically three cases

  1. nearby compensating indels, i.e. a 2-base insertion next to a 2-base deletion. This should be really rare, i'd think, although that of course depends on how much we're talking about. If lots of them are 1-base, and there's 10 per sequence, plenty will be close enough to compensate each other and be missed.
  2. indels that make the sequence align more closely to a different v gene, for instance if v1 and v2 differ mainly by a 3 base insertion at position 102, then any 3 base insertion near there will cause us to get the wrong gene. I don't have any intuition for how common this is, but it would be straightforward to model as long as you have some numbers for how frequent and how long the indels are expected to be.
  3. indels near the ends

This depends a ton on the real shm rate in the sequences. If this is a sample with 2% shm, you'll probably find almost all the indels. If it's 30% you'll find way, way fewer.

One thing that might help get a handle on this is varying the --gap-open-penalty that smith waterman uses to look for shm indels. The default value (30) was optimized for typical samples, and i have no idea if it's optimal for your case, but that aside, scanning it from like 1 to 150 and seeing what happens should give some insight into what's going on.

Also, if you have some cleverer way of finding the indels, you could run that beforehand and reverse the indels yourself before passing to partis.

scharch commented 5 years ago

Unfortunately, compensating in-dels are not that uncommon in 454 data. They're usually pretty easy to spot by eye (at least down to ~5 nt separation), but frequently hard to detect on an automated basis (clustalW can catch them at a reasonable rate, clustalO and muscle cannot). I've seen shifts over a span of 25-30 bases that will pass most checks if they manage to avoid creating a stop codon. The cause for concern here is that about 40% of the valid sequences are still marked as having shm_indel. In addition, despite being a low mutation IgM data set, partis finds relatively few that are completely germline:

gene input total unmutated
IGHV2-70*01 raw 390 0
IGHV2-70*11 filtered 262 254
IGHV1-69*12+T162C.G219A raw 716 0
IGHV1-69*12 filtered 600 586

This is very much an edge case, so I'm not terribly worried about it. If I have time, I will try playing around with the gap-open-penalty.

psathyrella commented 5 years ago

but frequently hard to detect on an automated basis (clustalW can catch them at a reasonable rate, clustalO and muscle cannot)

that sounds like trying the suggestion ^ to reverse the indels before passing to partis might be a good way to go, if you've got a good way to find them that you've optimized.

And yeah, if this is a low-shm sample, then cranking the gap open penalty way down may be all you need. The default value was optimized to work ok on all samples, but for optimality higher shm samples would need a higher gap open to avoid opening spurious indels. Indeed this is how the sw match/mismatch penalty is handled: the sequences are split apart into categories based on shm rate, and run with different match/mismatch scores (this is the source of the more-than-you-asked-for number of subprocesses you noticed a while ago). Typically shm indels are rare enough that it didn't seem worthwhile to do a similar optimization for gap open penatly.

psathyrella commented 5 years ago

@williamdlees found a nice example of germline inference blowing up (as expected) when there's lots of sequencing error that's correlated with position (in #283). The issue seems to be that in the Rubelt (heritable influence) paper the paired reads don't really overlap. You can see this by plotting where the Ns tend to occur -- they're pretty much all found between positions 170 and 210 in V. The crazy number of novel alleles (46) are also almost entirely based on positions that are also in this window.

scharch commented 5 years ago

I mean, 454 homopolymer indel shouldn't be correlated to position except insofar as where the homopolymer tracts are in each V gene, but maybe that's enough. I willing to conclude that 454 sequencing data would need to be filtered before doing inference with partis and leave it at that.