zero-bin to one-bin discontinuity in mutation frequencies

psathyrella commented 7 years ago

@lauranoges I forget if I've asked this before, but is there a biological or experimental reason why the kate-qrs and laura-mb samples would have plenty of sequences with 1 or greater mutations in V, but be depleted in those with zero mutations?I'm used to this distribution skootching to the right, or to the left, or being sharper, or flatter, for different samples, but it's usually pretty smooth. Whereas on these samples it looks like someone swiped half the sequences from the zero bin.

kate-qrs on left, laura-mb on right:

laura says: """ Hmm. With the kate QRS samples, we only sequenced IgG (memory) which are by definition usually SHM’ed - and they’re HIV-infected - so maybe that’s why there are no zero-mutated seqs? With my MB samples, I have the IgMs in there, so there should definitely be germline sequences. This is troubling, especially since you’ve already examined their germline sequences and they aren’t just really uncommon alleles, right?

"""

I would definitely expect hiv infection to shift the distribution to the right, which would decrease the fraction that're in the zero bin. It's the discontinuity between the 0- and 1-bin that seems weird, though, not the fraction in the 0-bin.

The isotype sorting seems more promising, though -- we're mostly sequencing plasma cells, right? And maybe they have to go through at least one cycle of shm before they're allowed to class-switch? But that doesn't really fit with the analogous plots for, say, the cdr3, which don't have the discontinuity (i.e. plenty of unmutated sequences, despite having the overall higher mutation rate you'd expec (middle row left):

I agree, my initial thought was it could be a germline problem -- for instance, it's hard to get the bases to the 3' side of the cysteine right when you're inferring new alleles, and a single position that was wrong would show up as a super-prevalent, and spurious, mutation. But all the per-gene per-position mutation plots look normal (no position above 0.5 or so). For instance:

(more of them here: /fh/fast/matsen_e/dralph/partis/tmp/kate-qrs-3k/plots/sw/mute-freqs/per-gene-per-position/v.html). Also, this would have to happen in most or all of the V genes to cause the 0-bin to 1-bin discontinuity. Whereas germline inference stuff is more about the periphery -- imgt has most of the alleles in anybody, the germline inference is just about fixing the few percent that're wrong, and the discontinuity is a bigger effect.

For comparison, I'm more used to seeing something like this

where I haven't done a good job of choosing particularly comparable samples -- they have lower mutation rates overall, and the bottom three don't have the full V read, but still, there's got to be some reason for the difference.

To be clear, I'm not necessarily super worried about this, but there's seems something between the repertoire and my plots filtering out sequences with unmutated V segments (but not cdr3s and Js), and it seems prudent to figure out if it's biological, or part of sequencing, or somewhere in my workflow (I mean I think it isn't the latter, but, you know...).

lauradoepker commented 7 years ago

Do you see this in the Laura mb samples too? The fact that these are Kenyan HIV infected individuals may be the key. Maybe IMGT really is skewed away from Kenyan alleles. But you're right about it being suspicious that it's every allele.

IgM sets should be helpful to analyze in Laura mb.

I wonder if it could potentially be how the VDJ region is trimmed for analysis? If there's an extra base at the beginning or end maybe that's the answer?

I'll think some more too...

psathyrella commented 7 years ago

laura-mb is the top right plots -- it's hard to tell, since the mutation rate overall is higher, the zero-bin looks more plausibly low as part of the overall distribution shape.

I think there's no question that imgt is missing lots of kenyan alleles, but that should get picked up at least reasonably well by the new allele inference, if nothing else.

Kristian says that, indeed, it should be very rare for class switching to be allowed without at least one round of shm. And the only inconsistency with that explanation was that the discontinuity doesn't show up in the cdr3 or J. But, I think that could just be that we don't measure mutation frequency as accurately within the cdr3 (the last base or two of V and first base or two of J you never really have any idea if it's an unmutated insertion or a mutated germline), i.e. I suspect the discontinuity is masked in the cdr3 plot by sequences in the 1-bin getting shuffled into the zero-bin. Then there's maybe no discontinuity in the J because it's relatively short and tends not to mutate, i.e. the single mutation in those sequences is prolly in the cdr3 or v.

I'm pretty happy with that explanation, anyway. And I'm glad to know about the effect -- it's important to know that smooth mutation freq distributions are a shitty assumption even if you're doing something as simple as sorting by isotype.

matsengrp / cft

zero-bin to one-bin discontinuity in mutation frequencies #153