psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

stops in inferred naive sequence #199

Closed jotwin closed 6 years ago

jotwin commented 8 years ago

With my data, in the annotation output I found that ~5% of sequences labeled productive had an inferred naive sequence with a stop in the cdr3 region. Is this a bug or a feature?

> d.test2$seqs.aa[1:10]
 [1] "ASGYTFTNFGLSWVRQAPGQGLEWMGWISGHNGNTKYAQKVXXXXXXXTDTSTSTAYMELRSLRSDDTAVYYCARDVSGWGYSSIWGQGT"     
 [2] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLXXXXXXXXDTSTSTAYMELRSLRSDDTAVYYCAEYLQWNAEHFQFWGQGT"    
 [3] "ASGYTLNNYAISWVRQAPGQGLEWMGWFSIHNGNTHYAQKFXXXXXXXXXXSTGTAYMELRSLTYDDTAIYYCARPYSGNYLRHNLHHWGQGT"  
 [4] "PSGYTFPNYGISWVRQASGQGLEWMGWISAYNGNTNYAQKLXXXXXXXXXXSTSTAYMELRSLRSDDTAVYYCARASVLPRRAEYFQHWGQGT"  
 [5] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLXXXXXXXXXXSTSTAYMELRSLRSDDTAVYYCARASVLPRRAEYFQHWGQGT"  
 [6] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLXXXXXXXXXXSTSTAYMKLRYLTAADTALYYCASQSRRWLQLRFFQHWGQGT"  
 [7] "ASGYIFSSYYVYWVRQAPGQGLEWMGWISPYNGRTNYAQKLXXXXXXXXXXXTRTAYMELRSLRSDDTAVYYCARVPWGYSGSHWDFQHWGQGT" 
 [8] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLXXXXXXXXXXXTSTAYMELRSLRSDDTAVYYCARDRDSGSYKPQYFQHWGQGT" 
 [9] "ASGYTFTNYAITWVRQAPGQGLEWMGWVSGVNGNTNYAQTLXXXXXXXXXXXTDTAYMELSSLRSEDTAVYYCATDRSGQQLVLSYFQHWGQGT" 
[10] "ASGYKFRNYGITWVRQAPGQGPEWMGWISVFSGDTNYAPKFXXXXXXXXXXXXFTAYLEVRRLRADDTAVYYCAKDVTDDEDRPVEYFQNWGQGT"
> d.test2$naive_seq.aa[1:10]
 [1] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARDV*QWEYFQHWGQGT"     
 [2] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAEY*QWNAEYFQHWGQGT"    
 [3] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARPYSGSY*PEYFQHWGQGT"  
 [4] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARASV*QWRAEYFQHWGQGT"  
 [5] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARASV*QWRAEYFQHWGQGT"  
 [6] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCASQS*RWLQLRYFQHWGQGT"  
 [7] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVPWGYSGSY*YFQHWGQGT" 
 [8] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARDRDSGSY*PEYFQHWGQGT" 
 [9] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARDRV*QQLVPEYFQHWGQGT" 
[10] "ASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARDVL*LRLGEAEYFQHWGQGT"
psathyrella commented 8 years ago

huh, I guess I could argue either way.

On the one hand, I'm trying to report the functionality of the mature sequence, not the naive one. Maybe there was a stop codon in the naive sequence, or maybe I'm a few bases off from the true naive sequence. On the other hand, of course, the real naive sequence can't very well have been unproductive.

With typical/high-ish (10%) mutation, this figure from our annotation paper gives you a feel for how wrong our naive sequences are: perhaps three bases off would be a reasonable eye-average for the single-sequence partis line. p

Those three bases are almost always going to be within the cdr3. And in retrospect, stop codons being the (combinatorically, at least) freakishly prevalent things that they are, I wouldn't have been surprised if the number was higher than 5% (in other words, if you randomly mutate 3 bases in the cdr3, what fraction of sequences will end up with stop codons?).

It's worth noting, in the plot, how much better you do if you can switch to simultaneous inference on a few clonally-related sequences: as long as the mutations within a lineage are somewhat independent, everybody probably won't have all the same mutations, which gives you a huge handle for working out what the real naive seq was.

As far as what to do, we could simply forbid naive sequences to have stop codons, and go back and find the best annotation for which the naive sequence doesn't have a stop codon. But I'm reluctant to do this, because in general people want to be able to run on unproductive sequences, in fact to skip unproductive sequences you have to specify a command-line flag.

I'm open to suggestions otherwise, but for the moment, I'd recommend: if the mature sequence is productive, I would take an unproductive reported naive sequence as indication that the uncertainty on the offending bases is large. In the long term, we could report the most-likely naive sequences, like we do now with per-gene support for the most likely genes, but this isn't the sort of thing that comes for free, since the efficiency of the viterbi algorithm is predicated on not knowing the full probability of every path.

krdav commented 8 years ago

Looking at my annotations I, expectedly, have the same experience. Here is what I think though; unproductive immunoglobulin sequences are indeed possible and very frequent in the shm process and therefore relevant to keep in the partis analysis. On the other hand matured immunoglobulins can per definition never arise from an unproductive naive sequence and therefore it hardly makes any sense to infer this. Finding the best productive naive sequence makes most sense in my opinion and I think if you use this information to remove the possibility of unproductive naive seq. then that will also end up improving the performance plot you show above.

psathyrella commented 8 years ago

huh. yeah, that's a really good point. ok, so I'll split it like so:

Now when I say "productive" above, originally we were talking about stop codons, so I'll definitely require no stop codons. But I (and I think everyone else) in general also require unmutated and in-frame cdr3-bounding cysteine and tryptophan. So, I'll do that, too -- but I'm less sure than I used to be (mostly based on this discussion) that these codons can be 100% reliably identified. I just think in the long term we may all discover that there are expressed sequences with mutated cyst/tryp.

krdav commented 8 years ago

I just think in the long term we may all discover that there are expressed sequences with mutated cyst/tryp

Yes, I think that is an important thing to keep in mind.

scharch commented 8 years ago
  • If query sequence is unproductive and
    • unmutated, I'll allow unproductive naive sequences (uh... tautologically, I guess)
    • mutated, presumably shm demolished its productivity, so I'll require productive naive sequences

That latter is not guaranteed to be correct; in most cases it means assuming that the unproductive query sequence was biologically productive but was subjected to PCR or sequencing error.

True unproductive sequences from a productive lineage would be extremely rare to observe as those cells get cleared pretty rapidly. However, unproductive rearrangements in cells that successfully undergo productive rearrangement on the other chromosome do undergo SHM. So if you see an unproductive sequence and you are convinced (perhaps thanks to UMIs) that it is not in error, then the naive sequences probably was unproductive, as well.

For what it's worth, I do think what you propose is the best default behavior. One idea, though, is to leverage the power of inferring from multiple sequences. If the precursor was truly unproductive, then most/all of the query sequences should be, as well.

scharch commented 8 years ago

As far as defining (un)productive: Frame is, indeed, critical, but the Cys/[Trp|Phe] can definitely be mutated --we have monoclonals from the CAP256-VRC26 lineage with the light chain Phe of FGxG mutated to I (eg http://www.ncbi.nlm.nih.gov/nuccore/KJ134889.1) But frame shift should be definitively detectable nonetheless, by aligning to the whole germline J gene.

krdav commented 8 years ago

Thanks for this post @scharch

I had never even considered what you said about unproductive naive sequences being possible carry on luggage from a succesfull rearrangement. A question then arises. I had the impression that a successful rearrangemnet (in frame recombination with no stop codons) would trigger a signal for allelic exclusion of the other, possibly unproductive or not yet recombined chromosome. It seems from what you are writing that this is not entirely the case?

scharch commented 8 years ago

That's correct - B cells transcribe the Ig locus off both alleles (see, for instance: http://www.ncbi.nlm.nih.gov/pubmed/9075923). As to whether or not you can see the resulting mRNA in bulk NGS experiments, see http://b-t.cr/t/looking-for-bcr-datasets-containing-nonproductive-recombinations/115/6 -we don't typically see them, but Christian does. And of course if you are sequencing from DNA, then you will for sure see these sequences either way!

psathyrella commented 6 years ago

subsuming into #238