psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
57 stars 34 forks source link

Choosing positions to test for allele finding #285

Closed scharch closed 5 years ago

scharch commented 5 years ago

How does partis do this? I have a dataset (actually the same one you used for the allele finding manual) that I originally processed using commit 00421b4. Among the inferred alleles from that run were IGHV2-7011+G1A and IGHV4-3103+A66G. Looking at the attached plots from that run, the evidence looks pretty strong. But when I redo the inference with the exact same input data using a91c3b6, it doesn't even consider those positions! What's going on?

IGHV2-70s11p1 IGHV4-31s03p66

psathyrella commented 5 years ago

There's a very long chain of logic to determine what positions in what genes make it as far as the final stage, i.e. the full fitting procedure. This chain is described in a fair amount of detail in the "Approximations and prefilters" section of the paper. Briefly, doing the fits for every position in every gene would require hundreds of thousands of fits, which is computationally prohibitive, so, for instance, we don't consider positions that don't have enough sequences, or don't have enough sequences in the relevant bins, or don't have appropriate slopes in a number of approximate-fitting procedures.

If it's actually missing a gene that is as solid as those ^ fits look, that sounds like a bug. While complicated, the procedure for deciding positions was tested very thoroughly, so something as egregious as that would be quite surprising (I certainly haven't seen it miss anything like that since very early versions in like 2016).

But is it definitely missing the genes resulting from those ^ inferences? For instance most of the genes it infers are not "novel", i.e. they're not previously unknown -- most are re-inferring genes that were removed in the allele-removal step. So I could imagine a different partis version, which inevitably shuffles sequences a bit due to random seeds and clonal family collapse (not much has changed in the germline inference code in the last year), could for instance change which alleles get kept in the removal step, necessitating different inferred alleles. I.e. how different are the resulting germline sets?

psathyrella commented 5 years ago

I should add, there definitely is a way to track down in excruciating detail why a particular position was or was not passed through each individual step, since "tested very thoroughly" ^ means I spent an unholy amount of time doing exactly this on tons of different cases.

--debug-allele-finding, if you're not already using it, also provides a lot of extra information.

scharch commented 5 years ago

So both versions are using the same default v gene database, and both are keeping IGHV2-70*11 and IGHV4-31*03 in the pruning step.

For IGHV2-70, the old version calls only the novel allele assigned 435 sequences, of which 54 are unmutated in V relative to the novel allele. The new version calls a different novel allele (IGHV2-70*11+T102C.A178T) with only 9 sequences and assigns 440 to IGHV2-7011, of which 0 are unmutated in V. For IGHV4-31, the old version calls only the novel allele, assigning 812 sequences, of which 293 are unmutated in V. The new version does not call a novel allele and assigns 817 sequences, of which only 3 are unmutated relative to the default IGHV4-31\03.

Other than these two genes, the two inferences are basically identical, although the new version has eliminated a few minor and/or presumably spurious novel alleles called by the old version (like IGHV4-39*07+C19T, which we previously discussed).

I have --debug-allele-finding on, but since partis is not even considering these positions, it hasn't been helpful to me. I'm attaching it here in case there's something I've missed. Then I'm going to go play with that debug function and see what I can figure out.

log.haplotyped.txt

scharch commented 5 years ago

OK, here's some additional weirdness: If I extract only VH4-31 reads (or even all VH4) and pass those to partis, even the new version finds IGHV4-31*03+A66G! The latter set also picks up a bunch of those low-frequency putative novel alleles that I mentioned, so maybe there is a frequency cut-off that's causing the problem? I think I have to redo the run on the whole input set...

For IGHV2-70, that ^ debug function didn't help, apparently because it's not even considering a 1 SNP potential novel alllele? Log: log2-70.txt (NB: I know the debug function is working/on, because it does report on position 66 for the IGHV4-31 inference...)

psathyrella commented 5 years ago

First, just to temper expectations, remember that even in the easiest cases (left side), germline inference is screwing up a significant fraction of genes (figure from the paper):

p

Generally when it's wrong can I go in afterward and figure out why it was wrong? Yeah, but usually that's much easier than being able to design it around never making that mistake again. The sample we're looking at now has extremely low mutation, which makes some parts of inference easier, since it's just a cleaner signal, but the fact that the SHM distributions is super spikey and almost discontinuous:

p

makes the two-piece linear model a less good approximation, making for a more challenging inference. (spike on right is just me wrapping overflow bins).

So both versions are using the same default v gene database, and both are keeping IGHV2-7011 and IGHV4-3103 in the pruning step.

doesn't sound like it's the issue here, but the default germline set has changed a bit in the last year, some genes of debatable functionality were removed.

warning trimming fwk insertion would take k_v min to less than zero for SRX2761071.149550: 40 - 104 = -64 note: maybe need to take reverse complement? (partis only searches in forward direction)

I feel like we discussed these warnings before, but I don't remember the conclusions. Looking at these sequences now, I would say they're pretty much garbage. About half are passing s-w, and half aren't; the ones that are passing look like this:

p

which is not necessarily directly related to germline inference, but as a philosophical matter (and in line with doing no error correction) partis tries as hard as it possibly can to find an annotation for every input sequence, and only fails them if it can't possibly find anything. Fwiw I never see this warning at all. As to the effect on germline inference, results are for sure going to be less stable if there's a good fraction of sequences that are adding a lot of noise, and in the last year the s-w treatment of marginal sequences has changed (it got smarter at resolving cases where the underlying s-w algorithm, ig-sw, called overlapping v/d or d/j matches).

although the new version has eliminated a few minor and/or presumably spurious novel alleles called by the old version (like IGHV4-39*07+C19T, which we previously discussed).

This changed as well, the min prevalence threshold is now also applied after germline inference.

If I extract only VH4-31 reads (or even all VH4) and pass those to partis, even the new version finds IGHV4-31*03+A66G!

While to a first approximation genes live in their own separate subspaces, behavior is definitely not guaranteed to be identical when you pull out a subset of sequences. There's thresholds that will be different when the repertoire is smaller, the clonal collapse will be different, etc.

The latter set also picks up a bunch of those low-frequency putative novel alleles that I mentioned, so maybe there is a frequency cut-off that's causing the problem?

There's lots of frequency cutoffs, and generally they're designed to function in a full-repertoire environment, so it doesn't surprise me too much that low prevalence alleles are more likely to show up when we slice out sequences from only one gene. More generally, one of the first things that would be nice to add would be well-informed priors on the germline set structure. It currently has no prior information for how many genes, how many alleles per gene, or the likelihood of particular genes, it should expect. I think this was the correct decision when the method was finalized in spring 2017, but is much less correct now, in particular I'm thinking of the Gidoni/Yaari Mosaic paper, which means we actually have some reasonable information that could inform a prior at this point.

For IGHV2-70, that ^ debug function didn't help, apparently because it's not even considering a 1 SNP potential novel alllele?

2-70 is likely unstable because it has multiple new alleles at overlapping positions. Here is one two-snp alleles it's calling:

p

now one of the requirements for the N positional fits from an N-snp allele is that the fits are all pretty similar. This is expressed in a number of ways, but something like they have similar slopes and intercepts. These two fits are similar enough that the automated procedure apparently thinks they're ok. It has to account for significant devations from the two-piece linear model caused by, for instance, super spikey shm distributions, so it can't require the two fits to be all that similar.

By eye, though, it's pretty clear to me that the lefthand position is from a 2-snp allele that's homozygous (i.e. 2-70*10 is not actually in the germline set -- slope zero intercept 1.0), whereas the righthand position looks more like from a 2-SNP allele with prevalence closer to 0.8. Generally, disentangling multiple alleles at overlapping positions is impossible to do well, at least unless you get lucky on how the positions and N-snp values overlap. Here, I think a possible cause is that the allele that was retained by the allele removal step was not the correct one, i.e. either:

I'm not sure that last bit is explained well enough, but anyway, if I'm saying there's two 2-SNP new alleles, where are the other positions? One possibility is that the two alleles both have position 102 (so that plot is a superposition of them), and the second position is too close to either the 5p or 3p ends, so we miss it. It's also possible that the two alleles don't overlap at a position, and we just got super unlucky and both of them have one position too near the ends.

Setting the dbg positions stuff like so (don't need to set the length, since we know all the positions, so it can just take the length of dbg_positions):

dbg_positions = [1]

yeah you're right doesn't add anything, since position 1 is in the 5p excluded region (--debug-allele-finding output from your second log file):

p

This bit of the method is described in the "Excluded bases on 5’ and 3’ ends" section of the paper. In fairly clean data, there's usually not an exclusion on the 5p end of V as long as the reads all extend throughout V, so this is likely telling us that they don't.

Doing the same thing for position 66 in a file with just 4-31 genes gives:

p

first line is checking that the bin totals in the N snpth and N snpth - 1 bin are very different, and using its uncertainty model it thinks they're 20.9 sigma different, which is greater than its threshold of 2.2, so it keeps going

Next it prints the input/bin values, and fit values for first the one-piece, and then the two-piece fit, and their ratio of chi-2 is like 16 thousand, so it calls the new allele.

Running on the whole file... uh, I actually get the same thing, modulo shuffling the numbers a bit:

p

psathyrella commented 5 years ago

One possibility is that the two alleles both have position 102 (so that plot is a superposition of them), and the second position is too close to either the 5p or 3p ends, so we miss it.

I just realized that when I wrote that I'd forgotten that your first plot at the top is probably one of these hypothetical missing positions. It's an N-snp 1 position, which is confusing, until I see that that means it's actually 2-70*01 which, looking at the allele removal step:

p

only missed being the most common allele by a handful of counts. So, maybe both of my suggestions are happening -- unluckiness with shm leads to the wrong allele getting chosen as the most common, together with multiple alleles with overlapping positions.

scharch commented 5 years ago

Just to note, this is not the same data set as #281, even though I happen to be looking at IGHV2-70 for both. But the point is well taken nonetheless: these old 454 data sets break some of the assumptions that you are making and need to be filtered reasonably carefully before doing inference. Still wouldn't expect big shifts in results between partis versions, but I think we've managed to explain those changes, more or less.

psathyrella commented 5 years ago

Just to note, this is not the same data set as #281

Yes, I wasn't thinking it was the same. This data seems for sure clean enough for good inference, while that in #281 does not, not even close. My comment about garbage sequences above I maybe should have removed, since it's not super important, I was kind of live blogging as I looked through things. In this sample, the garbage sequences don't seem to be causing any real problems, they'd be more on the periphery, for instance they could very well be why one version excluded two bases on the 5p of V, while another didn't -- that decision is dependent on a fairly small number of sequences.

Still wouldn't expect big shifts in results between partis versions, but I think we've managed to explain those changes, more or less.

I think it's a matter of whether this is 'big' or not. On the one hand, I totally agree -- your first two plots at the top are, statistically, very big signals. But there's also large systematic uncertainties, stemming from how well the model mimics the data, and these manifest as things like missing a position because it was too close to the ends, or choosing to pair the wrong positions together in a multiple allele scenario. So in terms of final germline set, I would not characterize a couple of different alleles as a big difference. More intuitively, I've gone through this exercise with for instance the samples in that green/red tree figure from the paper ^, and some missing/spurious alleles are from marginal statistical fits, but as many or more are from these kinds of systematic uncertainties/model screw ups.